Sistemas robustos de verificación de locutores basados en redes neuronales profundas

  1. Gómez Alanís, Alejandro
Dirigida por:
  1. Antonio Miguel Peinado Herreros Director
  2. José Andrés González López Codirector

Universidad de defensa: Universidad de Granada

Fecha de defensa: 21 de enero de 2022

Tribunal:
  1. M. Carmen Benítez Ortuzar Presidenta
  2. José Luis Pérez Córdoba Secretario
  3. Massimiliano Todisco Vocal
  4. Alfonso Ortega Giménez Vocal
  5. Héctor Delgado Flores Vocal
Departamento:
  1. ELECTRÓNICA Y TECNOLOGÍA DE COMPUTADORES

Tipo: Tesis

Resumen

In a world becoming more and more digital, the need for robust authentication methods enabling the secured access to resources and systems is becoming crucial. In early stages, the identity management systems relied on cryptographic methods requiring the users to remember a password, store cards or even a combination of both to prove their identity. As opposed to these authentication methods, a more natural alternative for human identification/verification is that based on physiological (fingerprint, face, iris, etc) or behavioral (voice, gait, signature, etc) attributes of individuals known as biometrics. This Thesis is focused on voice biometric systems for human verification where the speech signal is employed for making a one-to-one comparison between the user's voice and all the enrolled voices stored in the database. The main goal of this Thesis is the development of robust automatic speaker verification (ASV) systems which are able to detect the two main types of biometric attacks: (i) zero-effort attacks, where a non-enrolled speaker utters bonafide speech in order to try to gain access as an enrolled speaker; and (ii) spoofing attacks, where an impostor tries to gain fraudulent access by presenting speech resembling the voice of a genuine enrolled speaker. The vulnerability of ASV systems to malicious spoofing attacks is a serious concern nowadays, since an impostor can easily present a pre-recorded voice of an enrolled user (replay spoofing attack), generates artificial voice resembling the voice of an enrolled user (text-to-speech spoofing attack), or transform the voice recording of a given speaker so that it sounds as that from an enrolled speaker without changing the phonetic content of the recording (voice conversion spoofing attack). For making voice biometric systems more robust to this type of attacks, we propose the following contributions in this Thesis. First, we have have dealt with the problem of spoofing attack detection for voice biometric systems. The main problem here is the lack of robustness and generalization across different databases. We addressed this issue by proposing a novel neural network architecture which can be used for detecting both logical and physical access spoofing attacks. The proposed convolutional RNN-based architecture is able to process the whole input utterance without cropping it or applying any post-processing combination of chunks. Moreover, since noisy acoustic scenarios can significantly degrade the performance of anti-spoofing systems, we have also proposed two noise-aware techniques based on the usage of masks which help to effectively reduce the performance degradation. Our best performing technique involves the computation and use of signal-to-noise masks that inform the DNN-based spoofing embedding extractor of the noise probability for each time-frequency bin in the input speech spectrogram. Secondly, we also proposed new loss functions which can be effectively used by anti-spoofing and integration of ASV and anti-spoofing systems. We have proposed a new probabilistic loss function for supervised metric learning, where every training class is represented with a probability density function using all the samples of the mini-batch and is estimated through kernel density estimation. We can argue that each class is more accurately represented than in other popular loss functions. Moreover, the proposed loss function replaces the concept of distance between embeddings in negative hard-mining techniques by the concept that an embedding belongs to a class with a given probability. This has the advantage of avoiding the selection of an appropiate distance measure and tuning extra hyper-parameters such as distance margins. Furthermore, we also propose a new loss function for integration systems based on the expected performance and spoofability curve (EPSC) which allows to optimize the voice biometric system in the operating range, instead of only one operating point, in which it is expected to work during evaluation. These proposals allow to improve significantly the performance of both anti-spoofing and complete voice biometric systems. Third, we have studied the integration of ASV and anti-spoofing systems at the score-level and at the embedding-level. To avoid the integration of ASV and anti-spoofing systems at the score-level using scores computed separately, we proposed a new neural network architecture for integrating the systems at the embedding-level which exploits the fact that ASV and anti-spoofing systems share the bonafide speech subspace. Thus, the proposed integration system is able to model the three main biometric speech subspaces: bonafide speech, zero-effort attacks and spoofing attacks. Experimental results on the ASVspoof 2019 corpus show that the joint processing of the ASV and anti-spoofing embeddings with the proposed integration neural network clearly outperforms other state-of-the-art techniques trained and evaluated on the same conditions. Finally, we have studied the robustness of the state-of-the-art voice biometric systems under the presence of adversarial spoofing attacks. Furthermore, we also proposed a new DNN-based generator network for this type of attacks which is trained using existing spoofing attacks and it can be used for finetuning the biometric system in order to make it more robust to adversarial spoofing attacks. Experimental results show that voice biometric systems are highly sensitive to adversarial spoofing attacks in both logical and physical access scenarios. Moreover, the proposed ABTN generator clearly outperforms other classical adversarial attacks techniques such as the fast gradient signed method (FGSM) and the projected gradient descent (PGD). To conclude, we would like to highlight that our contributions successfully integrate the signal processing and deep learning methods for developing robust voice biometric systems. As a result, our proposed biometric system obtained one of the best single-system results in terms of equal error rate (EER) and minimum tandem detection cost function (min-tDCF) in the ASVspoof 2019 Challenge for both logical and physical access attack scenarios.