Table of Contents
Fetching ...

SynesLM: A Unified Approach for Audio-visual Speech Recognition and Translation via Language Model and Synthetic Data

Yichen Lu, Jiaqi Song, Xuankai Chang, Hengwei Bian, Soumi Maiti, Shinji Watanabe

TL;DR

SynesLM addresses the challenge of unified audio-visual language understanding by building a decoder-only Transformer that fuses discrete speech representations with full-frame visual information. It introduces a synthetic image data recovery pipeline to strengthen audio-visual correlations and enables multitask capabilities across AV-ASR, VST, and VMT using modality and language tokens. The approach achieves competitive AV-ASR performance and significant BLEU gains in translation tasks, including a zero-shot AV-ASR state-of-the-art improvement on VisSpeech, and demonstrates robust cross-modal fusion. These findings suggest practical potential for integrated multimodal speech and language processing in real-world settings.

Abstract

In this work, we present SynesLM, an unified model which can perform three multimodal language understanding tasks: audio-visual automatic speech recognition(AV-ASR) and visual-aided speech/machine translation(VST/VMT). Unlike previous research that focused on lip motion as visual cues for speech signals, our work explores more general visual information within entire frames, such as objects and actions. Additionally, we use synthetic image data to enhance the correlation between image and speech data. We benchmark SynesLM against the How2 dataset, demonstrating performance on par with state-of-the-art (SOTA) models dedicated to AV-ASR while maintaining our multitasking framework. Remarkably, for zero-shot AV-ASR, SynesLM achieved SOTA performance by lowering the Word Error Rate (WER) from 43.4% to 39.4% on the VisSpeech Dataset. Furthermore, our results in VST and VMT outperform the previous results, improving the BLEU score to 43.5 from 37.2 for VST, and to 54.8 from 54.4 for VMT.

SynesLM: A Unified Approach for Audio-visual Speech Recognition and Translation via Language Model and Synthetic Data

TL;DR

SynesLM addresses the challenge of unified audio-visual language understanding by building a decoder-only Transformer that fuses discrete speech representations with full-frame visual information. It introduces a synthetic image data recovery pipeline to strengthen audio-visual correlations and enables multitask capabilities across AV-ASR, VST, and VMT using modality and language tokens. The approach achieves competitive AV-ASR performance and significant BLEU gains in translation tasks, including a zero-shot AV-ASR state-of-the-art improvement on VisSpeech, and demonstrates robust cross-modal fusion. These findings suggest practical potential for integrated multimodal speech and language processing in real-world settings.

Abstract

In this work, we present SynesLM, an unified model which can perform three multimodal language understanding tasks: audio-visual automatic speech recognition(AV-ASR) and visual-aided speech/machine translation(VST/VMT). Unlike previous research that focused on lip motion as visual cues for speech signals, our work explores more general visual information within entire frames, such as objects and actions. Additionally, we use synthetic image data to enhance the correlation between image and speech data. We benchmark SynesLM against the How2 dataset, demonstrating performance on par with state-of-the-art (SOTA) models dedicated to AV-ASR while maintaining our multitasking framework. Remarkably, for zero-shot AV-ASR, SynesLM achieved SOTA performance by lowering the Word Error Rate (WER) from 43.4% to 39.4% on the VisSpeech Dataset. Furthermore, our results in VST and VMT outperform the previous results, improving the BLEU score to 43.5 from 37.2 for VST, and to 54.8 from 54.4 for VMT.
Paper Structure (11 sections, 2 equations, 3 figures, 3 tables)

This paper contains 11 sections, 2 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: An overview of SynesLM architecture. The definition of the special tokens will be discussed at the end of the Section \ref{['sec:method']}.
  • Figure 2: Synthetic Data Recovery Pipeline.
  • Figure 3: Qualitative examples on How2 ASR. We show that our audio with original visual (A+OV) and audio with synthetic visual (A+SV) method successfully extract and understand the information from the image and corporate the information with speech representation to perform ASR task.