Robust prosody modeling for synthetic speech detection

Ariel Cohen, Denis Shyrman, Aleksandr Solonskyi, Roman Frenkel, Arkady Krishtul, Oren Gal

Research output: Contribution to journalArticlepeer-review

Abstract

This paper presents a comprehensive study on developing and implementing a speech prosody extractor to enhance audio security in Automatic Speaker Verification (ASV) systems. Our novel training approach, which operates without exposure to spoofing examples, significantly improves the modeling of essential prosodic elements often overlooked in deep fake attacks. By integrating codec and recording device embeddings, the prosody extractor effectively neutralizes codec-specific distortions, enhancing robustness across various audio transmission channels. Combined with state-of-the-art ASV systems, our prosody extractor reduces the Equal Error Rate (EER) by an average of 49.15% without codecs, 50.53% with the g711 codec, 44.77% with the g729 codec, 43.43% with the Vonage11https://www.vonage.com/. channel, 42.05% with ECAPA-TDNN, and 45.17% with TitaNet across diverse datasets, including high-quality commercial deep fakes.22https://elevenlabs.io/.,33https://play.ht/voice-cloning/. This integration markedly improves the detection and mitigation of sophisticated spoofing attempts, especially in compressed or altered audio environments. Our methodology also eliminates the dependency on textual data during training, enabling the use of larger and more varied datasets.
Original languageEnglish
Pages (from-to)103283
JournalSpeech Communication
Volume174
DOIs
StatePublished - Oct 2025

Keywords

  • Anti-spoofing
  • Prosody
  • Speaker-verification
  • Deep fake
  • Synthetic speech detection

Fingerprint

Dive into the research topics of 'Robust prosody modeling for synthetic speech detection'. Together they form a unique fingerprint.

Cite this