TY - JOUR
T1 - Robust prosody modeling for synthetic speech detection
AU - Cohen, Ariel
AU - Shyrman, Denis
AU - Solonskyi, Aleksandr
AU - Frenkel, Roman
AU - Krishtul, Arkady
AU - Gal, Oren
PY - 2025/10
Y1 - 2025/10
N2 - This paper presents a comprehensive study on developing and implementing a speech prosody extractor to enhance audio security in Automatic Speaker Verification (ASV) systems. Our novel training approach, which operates without exposure to spoofing examples, significantly improves the modeling of essential prosodic elements often overlooked in deep fake attacks. By integrating codec and recording device embeddings, the prosody extractor effectively neutralizes codec-specific distortions, enhancing robustness across various audio transmission channels. Combined with state-of-the-art ASV systems, our prosody extractor reduces the Equal Error Rate (EER) by an average of 49.15% without codecs, 50.53% with the g711 codec, 44.77% with the g729 codec, 43.43% with the Vonage11https://www.vonage.com/. channel, 42.05% with ECAPA-TDNN, and 45.17% with TitaNet across diverse datasets, including high-quality commercial deep fakes.22https://elevenlabs.io/.,33https://play.ht/voice-cloning/. This integration markedly improves the detection and mitigation of sophisticated spoofing attempts, especially in compressed or altered audio environments. Our methodology also eliminates the dependency on textual data during training, enabling the use of larger and more varied datasets.
AB - This paper presents a comprehensive study on developing and implementing a speech prosody extractor to enhance audio security in Automatic Speaker Verification (ASV) systems. Our novel training approach, which operates without exposure to spoofing examples, significantly improves the modeling of essential prosodic elements often overlooked in deep fake attacks. By integrating codec and recording device embeddings, the prosody extractor effectively neutralizes codec-specific distortions, enhancing robustness across various audio transmission channels. Combined with state-of-the-art ASV systems, our prosody extractor reduces the Equal Error Rate (EER) by an average of 49.15% without codecs, 50.53% with the g711 codec, 44.77% with the g729 codec, 43.43% with the Vonage11https://www.vonage.com/. channel, 42.05% with ECAPA-TDNN, and 45.17% with TitaNet across diverse datasets, including high-quality commercial deep fakes.22https://elevenlabs.io/.,33https://play.ht/voice-cloning/. This integration markedly improves the detection and mitigation of sophisticated spoofing attempts, especially in compressed or altered audio environments. Our methodology also eliminates the dependency on textual data during training, enabling the use of larger and more varied datasets.
KW - Anti-spoofing
KW - Prosody
KW - Speaker-verification
KW - Deep fake
KW - Synthetic speech detection
U2 - 10.1016/j.specom.2025.103283
DO - 10.1016/j.specom.2025.103283
M3 - Article
SN - 0167-6393
VL - 174
SP - 103283
JO - Speech Communication
JF - Speech Communication
ER -