AMPS: ASR with Multimodal Paraphrase Supervision
Abhishek Gupta, Amruta Parulekar, Sameep Chattopadhyay, Preethi Jyothi
TL;DR
This work tackles robust ASR for spontaneous multilingual speech by enriching a multilingual multimodal model (SeamlessM4T) with paraphrase-based supervision that operates on the text-to-text pathway. Paraphrase supervision is selectively invoked when ASR loss is high, integrating a new auxiliary objective $L_{PAR}$ with the ASR objective to form $L_{AMPS}$. The approach yields relative WER improvements of up to 5% across languages such as Hindi, Marathi, Malayalam, Kannada, and Nyanja, supported by objective metrics and human evaluations, including challenging hard-sentence subsets and low-resource scenarios. The results demonstrate the viability of multimodal, paraphrase-informed training to improve transcription quality in diverse, real-world speech, with potential extension to atypical or impaired speech and other low-resource languages.
Abstract
Spontaneous or conversational multilingual speech presents many challenges for state-of-the-art automatic speech recognition (ASR) systems. In this work, we present a new technique AMPS that augments a multilingual multimodal ASR system with paraphrase-based supervision for improved conversational ASR in multiple languages, including Hindi, Marathi, Malayalam, Kannada, and Nyanja. We use paraphrases of the reference transcriptions as additional supervision while training the multimodal ASR model and selectively invoke this paraphrase objective for utterances with poor ASR performance. Using AMPS with a state-of-the-art multimodal model SeamlessM4T, we obtain significant relative reductions in word error rates (WERs) of up to 5%. We present detailed analyses of our system using both objective and human evaluation metrics.
