Embargo until
Journal Title
Journal ISSN
Volume Title
Johns Hopkins University
Voice-enabled interfaces for human-machine interaction have made significant progress in recent years. Most of the success can be attributed to deep neural networks trained on thousands of hours of transcribed data. However, vast amounts of labeled data are not available for most spoken languages worldwide, e.g., regional languages. Here we explore alternate techniques that can learn directly from data without any or minimal manual transcriptions. The representation techniques employed to characterize the speech signal dictate the performance of unsupervised systems. Self-supervised methods such as Contrastive Predictive Coding (CPC) have emerged as a promising technique for representation learning from unlabeled speech data. Based on the observation that the acoustic information, e.g., phones, changes slower than the feature extraction rate in CPC, we propose regularization techniques that impose slowness constraints on the features. First, we propose two regularization techniques: Self-expressing constraint and Left-or-Right regularization. Our modifications outperform the baseline CPC in monolingual, cross-lingual, or multilingual settings on the ABX and linear phone classification benchmarks. However, CPC or our modifications mainly look at the audio signal's structure at the frame level. The speech structure exists beyond the frame level, i.e., at the phone level or even higher. We propose a segmental contrastive predictive coding (SCPC) framework to learn from the signal structure at both the frame and phone levels. SCPC is a hierarchical model with three stages trained in an end-to-end manner. In the first stage, the model predicts future feature frames and extracts frame-level representation from the raw waveform. In the second stage, a differentiable boundary detector finds variable-length segments. In the last stage, the model predicts future segments to learn segment representations. Experiments show that our model outperforms existing phone and word segmentation methods on TIMIT and Buckeye datasets. In the last part, we explore knowledge distillation from text encoders (e.g., Roberta) to speech encoders in an unsupervised manner in a multimodal setting. Text encoders operate at the sub-word level, while speech encoders operate at a much smaller scale, i.e., frames. Our segmental framework allows us to downsample frames and generate sub-words. SCPC enables us to leverage pretrained text encoders in an audio-visual and audio-only setting. We show significant performance improvements on the audio-image retrieval and semantic similarity task.
Zero resource speech processing, self-supervised learning, unsupervised phone segmentation, unsupervised word segmentation, multimodal learning