Ground-Truth Transcriptions of Real Music from Force-Aligned MIDI Syntheses
Robert J. Turetsky
Daniel P. W. Ellis
MetadataShow full item record
Many modern polyphonic music transcription algorithms are presented in a statistical pattern recognition framework. But without a large corpus of real-world music transcribed at the note level, these algorithms are unable to take advantage of supervised learning methods and also have difficulty reporting a quantitative metric of their performance, such as a Note Error Rate. We attempt to remedy this situation by taking advantage of publicly-available MIDI transcriptions. By force-aligning synthesized audio generated from a MIDI transcription with the raw audio of the song it represents we can correlate note events within the MIDI data with the precise time in the raw audio where that note is likely to be expressed. Having these alignments will support the creation of a polyphonic transcription system based on labeled segments of produced music. But because the MIDI transcriptions we find are of variable quality, an integral step in the process is automatically evaluating the integrity of the alignment before using the transcription as part of any training set of labeled examples. Comparing a library of 40 published songs to freely available MIDI files, we were able to align 31 (78%). We are building a collection of over 500 MIDI transcriptions matching songs in our commercial music collection, for a potential total of 35 hours of note-level transcriptions, or some 1.5 million note events.