Child Speech Recognition as Low Resource Automatic Speech Recognition

Embargo until
Date
2020-05-12
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Johns Hopkins University
Abstract
This thesis investigates child speech recognition as a low-resource scenario of automatic speech recognition (ASR), and explores multiple methods to improve the performance of both hybrid and end-to-end ASR models in recognizing children's speech. Similar to ASR for adults, child speech recognition aims to transcribe the content of audio recordings into text automatically. Due to the difference in vocal characteristics, ASR models trained on only adult speech data are not adequate for recognizing child speech. With limited public available child speech corpora, recognizing child speech calls for more data-efficient methods to develop ASR systems. In this thesis, three strategies widely used in low-resource ASR are investigated for child speech recognition: Using compact model parameterization: factorized time delay neural networks (TDNN-F) are used as more data-efficient acoustic models (AM) for Deep Neural Network (DNN)-HMM hybrid ASR models; Adapting models trained on out-of-domain data: transfer learning is used to adapt end-to-end ASR model trained on adult speech for child speech recognition Making creative use of available in-domain data: different data augmentation methods are applied to enhance existing child speech data to train hybrid ASR models. Empirical results are presented on several publicly available data sets, and are compared with previously published results on the same data sets.
Description
Keywords
Automatic Speech Recognition (ASR), Child speech recognition
Citation