Robust Speaker Recognition using Perceptual and Adversarial Speech Enhancement
Johns Hopkins University
In Automatic Speaker Verification (ASV), we determine whether the speaker in the test utterance is identical to the previously enrolled speaker. Deep learning has significantly improved ASV performance. However, it is still susceptible to external disturbances and domain mismatches. A standard solution is data augmentation, i.e., adding noise and reverberations in training data. We focus on developing pre-processing solutions that can be integrated with existing pipelines and advance empirical performance on state-of-the-art systems. For this, we pursue deep learning-based speech enhancement and develop solutions equipped with denoising, domain adaptation, and bandwidth extension (BWE). Existing speech enhancement solutions often lead to degradation in ASV performance, partly due to loss of speaker information. We propose using perceptual/deep features that leverage pre-trained models to handle this. We also prove the effectiveness of our denoiser by showing that it complements the missing noise class in the x-vector (training) data augmentation through ablation studies. We also improve the training data for telephony speaker verification, where it is a common practice to downsample higher-bandwidth microphone speech to lower sampling frequency and apply telephone codecs. We propose to replace this by learning a mapping using a deep feature-based CycleGAN. Our novel technique improves training data and complements the prior techniques, including data augmentation. To handle bandwidth mismatch, we pursue BWE with time-domain architectures. We develop competent Generative Adversarial Networks (GAN): supervised (conditional GAN) and unsupervised (CycleGAN). Our findings indicate that unsupervised learning can give close performance to supervised performance. We also pursue joint learning BWE schemes with domain adaptation. Finally, with our proposed Self-FiLM scheme, we leverage self-supervised representations to guide BWE models better in unknown environments. In conclusion, we provide evidence that speech enhancement can be used as a pre-processor for improving ASV. By testing on real data acquired from Speaker Recognition Evaluation challenges, we demonstrate the effectiveness of speaker-identity preserving denoisers. We also study the effectiveness of domain adaptation and Self-Supervised Learning to improve bandwidth extension. Our work opens research into a joint investigation of enhancement-related problems and better generative models to assist the x-vector.
Speech Enhancement, Speaker Recognition, Generative Adversarial Networks