• Login
    View Item 
    •   JScholarship Home
    • Theses and Dissertations, Electronic (ETDs)
    • ETD -- Doctoral Dissertations
    • View Item
    •   JScholarship Home
    • Theses and Dissertations, Electronic (ETDs)
    • ETD -- Doctoral Dissertations
    • View Item
    JavaScript is disabled for your browser. Some features of this site may not work without it.

    WAKE WORD DETECTION AND ITS APPLICATIONS

    Thumbnail
    View/Open
    WANG-DISSERTATION-2021.pdf (2.027Mb)
    Date
    2021-07-13
    Author
    Wang, Yiming
    0000-0002-5588-8241
    Metadata
    Show full item record
    Abstract
    Always-on spoken language interfaces, e.g. personal digital assistants, rely on a wake word to start processing spoken input. Novel methods are proposed to train a wake word detection system from partially labeled training data, and to use it in on-line applications. In the system, the prerequisite of frame-level alignment is removed, permitting the use of un-transcribed training examples that are annotated only for the presence/absence of the wake word. Also, an FST-based decoder is presented to perform online detection. The suite of methods greatly improve the wake word detection performance across several datasets. A novel neural network for acoustic modeling in wake word detection is also investigated. Specifically, the performance of several variants of chunk-wise streaming Transformers tailored for wake word detection is explored, including looking-ahead to the next chunk, gradient stopping, different positional embedding methods and adding same-layer dependency between chunks. Experiments demonstrate that the proposed Transformer model outperforms the baseline convolutional network significantly with a comparable model size, while still maintaining linear complexity w.r.t. the input length. For the application of the detected wake word in ASR, the problem of improving speech recognition with the help of the detected wake word is investigated. Voice-controlled house-hold devices face the difficulty of performing speech recognition of device-directed speech in the presence of interfering background speech. Two end-to-end models are proposed to tackle this problem with information extracted from the anchored segment. The anchored segment refers to the wake word segment of the audio stream, which contains valuable speaker information that can be used to suppress interfering speech and background noise. A multi-task learning setup is also explored where the ideal mask, obtained from a data synthesis procedure, is used to guide the model training. In addition, a way to synthesize "noisy" speech from "clean" speech is also proposed to mitigate the mismatch between training and test data. The proposed methods show large word error reduction for Amazon Alexa live data with interfering background speech, without sacrificing the performance on clean speech.
    URI
    http://jhir.library.jhu.edu/handle/1774.2/64380
    Collections
    • ETD -- Doctoral Dissertations

    DSpace software copyright © 2002-2016  DuraSpace
    Policies | Contact Us | Send Feedback
    Theme by 
    Atmire NV
     

     

    Browse

    All of JScholarshipCommunities & CollectionsBy Issue DateAuthorsTitlesSubjectsThis CollectionBy Issue DateAuthorsTitlesSubjects

    My Account

    LoginRegister

    DSpace software copyright © 2002-2016  DuraSpace
    Policies | Contact Us | Send Feedback
    Theme by 
    Atmire NV