End-to-end modeling for speech spoofing and deepfake detection

Tak, Hemlata

Voice biometric systems are being used in various applications for secure user authentication using automatic speaker verification technology. However, these systems are vulnerable to spoofing attacks, which have become even more challenging with recent advances in artificial intelligence algorithms. There is hence a need for more robust, and efficient detection techniques. This thesis proposes novel detection algorithms which are designed to perform reliably in the face of the highest-quality attacks. The first contribution is a non-linear ensemble of sub-band classifiers each of which uses a Gaussian mixture model. Competitive results show that models which learn sub-band specific discriminative information can substantially outperform models trained on full-band signals. Given that deep neural networks are more powerful and can perform both feature extraction and classification, the second contribution is a RawNet2 model. It is an end-to-end (E2E) model which learns features directly from raw waveform. The third contribution includes the first use of graph neural networks (GNNs) with an attention mechanism to model the complex relationship between spoofing cues present in spectral and temporal domains. We propose an E2E spectro-temporal graph attention network called RawGAT-ST. RawGAT-ST model is further extended to an integrated spectro-temporal graph attention network, named AASIST which exploits the relationship between heterogeneous spectral and temporal graphs. Finally, this thesis proposes a novel data augmentation technique called RawBoost and uses a self-supervised, pre-trained speech model as a front-end to improve generalisation in the wild conditions.

Digital Security
Eurecom Ref:
© EURECOM. Personal use of this material is permitted. The definitive version of this paper was published in Thesis and is available at :
See also:

PERMALINK : https://www.eurecom.fr/publication/7273