Abstract
Deepfake speech detection presents a growing challenge as generative audio technologies continue to advance. We propose a hybrid training framework that advances detection performance through novel augmentation strategies. First, we introduce a dual-stage masking approach that operates both at the waveform level via time–frequency masking (MaskedSpec) and within the latent feature space (MaskedFeature), providing complementary regularization that improves tolerance to localized distortions and enhances generalization. Second, we introduce a compression-aware strategy during self-supervised learning to increase variability in low-resource scenarios while preserving the integrity of learned representations, thereby improving the suitability of pretrained features for deepfake detection. The framework integrates a learnable self-supervised feature extractor with a ResNet classification head in a unified training pipeline, enabling joint adaptation of acoustic representations and discriminative patterns. On the ASVSpoof5 Challenge (Track 1), the system achieves state-of-the-art results with an Equal Error Rate (EER) of 4.08% under closed conditions, further reduced to 2.71% through fusion of models with diverse pretrained feature extractors. When trained on ASVSpoof2019, our system obtains leading performance on the ASVSpoof2019 evaluation set (0.18% EER) and the ASVSpoof2021 DF task (2.92% EER).
| Original language | English |
|---|---|
| Article number | 101994 |
| Journal | Computer Speech and Language |
| Volume | 101 |
| DOIs | |
| State | Published - Jan 2027 |
Bibliographical note
Publisher Copyright:Copyright © 2026. Published by Elsevier Ltd.
Keywords
- ASVSpoof5
- Deepfake speech detection
- Speech augmentations
- Speech features
- Speech processing
ASJC Scopus subject areas
- Software
- Theoretical Computer Science
- Human-Computer Interaction
Fingerprint
Dive into the research topics of 'Unmasking deepfakes: Leveraging augmentations and features variability for deepfake speech detection'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver