Graduate School and Research Center in Digital Sciences

Discriminate natural versus loudspeaker emitted speech

Le, Thanh-Ha; Gilberton, Philippe; Duong, Ngoc Q. K

ICASSP 2019, International Conference on Acoustics, Speech, and Signal Processing, 12-17 May 2019, Brighton, UK

In this work, we address a novel, but potentially emerging, problem of discriminating the natural human voices and those played back by any kind of audio devices in the context of interactions with in-house voice user interface. The tackled problem may find relevant applications in (1) the far-field voice interactions of vocal interfaces such as Amazon Echo, Google Home, Facebook Portal, etc, and (2) the replay spoofing attack detection. The detection of loudspeaker emitted speech will help avoiding false wake-ups or unintended interactions with the devices in the first application, while eliminating attacks involve the replay of recordings collected from enrolled speakers in the second one. At first we collect a real-world dataset under well-controlled conditions containing two classes: recorded speeches directly spoken by numerous people (considered as the natural speech), and recorded speeches played back from various loudspeakers (considered as the loudspeaker emitted speech). Then from this dataset, we build prediction models based on Deep Neural Network (DNN) for which different combination of audio features have been considered. Experiment results confirm the feasibility of the task where the combination of audio embeddings extracted from SoundNet and VGGish network yields the classification accuracy up to about 90%.

Document Doi Arxiv Bibtex

Title:Discriminate natural versus loudspeaker emitted speech
Keywords:Natural versus device emitted speech classification, voice interaction, audio features, deep neural network (DNN)
Department:Digital Security
Eurecom ref:5801
Copyright: © 2019 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE.
Bibtex: @inproceedings{EURECOM+5801, doi = {}, year = {2019}, title = {{D}iscriminate natural versus loudspeaker emitted speech}, author = {{L}e, {T}hanh-{H}a and {G}ilberton, {P}hilippe and {D}uong, {N}goc {Q}. {K}}, booktitle = {{ICASSP} 2019, {I}nternational {C}onference on {A}coustics, {S}peech, and {S}ignal {P}rocessing, 12-17 {M}ay 2019, {B}righton, {UK}}, address = {{B}righton, {UNITED} {KINGDOM}}, month = {05}, url = {} }