Table of Contents
Fetching ...

Large Language Models are Strong Audio-Visual Speech Recognition Learners

Umberto Cappellazzo, Minsu Kim, Honglie Chen, Pingchuan Ma, Stavros Petridis, Daniele Falavigna, Alessio Brutti, Maja Pantic

TL;DR

Llama-AVSR demonstrates that a decoder-only multimodal LLM can perform ASR, VSR, and AVSR by freezing large pre-trained audio and video encoders and connecting them to the LLM via lightweight modality-specific projectors and LoRA modules. The approach uses a downsampling compressor parameter $K$ to manage token load and relies on pre-trained encoders (Whisper, AV-HuBERT) and an LLM (Llama3.1-8B). On LRS3 (with VoxCeleb2 augmentation), it achieves state-of-the-art ASR and AVSR WERs while updating only tens of millions of parameters, and ablations reveal the critical roles of encoder/LLM choice, LoRA, and compression rate. The work highlights a scalable, modular pathway to leverage powerful LLMs for audio-visual speech recognition with strong practical impact in noisy environments and resource-constrained settings.

Abstract

Multimodal large language models (MLLMs) have recently become a focal point of research due to their formidable multimodal understanding capabilities. For example, in the audio and speech domains, an LLM can be equipped with (automatic) speech recognition (ASR) abilities by just concatenating the audio tokens, computed with an audio encoder, and the text tokens to achieve state-of-the-art results. On the contrary, tasks like visual and audio-visual speech recognition (VSR/AVSR), which also exploit noise-invariant lip movement information, have received little or no attention. To bridge this gap, we propose Llama-AVSR, a new MLLM with strong audio-visual speech recognition capabilities. It leverages pre-trained audio and video encoders to produce modality-specific tokens which, together with the text tokens, are processed by a pre-trained LLM (e.g., Llama3.1-8B) to yield the resulting response in an auto-regressive fashion. Llama-AVSR requires a small number of trainable parameters as only modality-specific projectors and LoRA modules are trained whereas the multi-modal encoders and LLM are kept frozen. We evaluate our proposed approach on LRS3, the largest public AVSR benchmark, and we achieve new state-of-the-art results for the tasks of ASR and AVSR with a WER of 0.79% and 0.77%, respectively. To bolster our results, we investigate the key factors that underpin the effectiveness of Llama-AVSR: the choice of the pre-trained encoders and LLM, the efficient integration of LoRA modules, and the optimal performance-efficiency trade-off obtained via modality-aware compression rates.

Large Language Models are Strong Audio-Visual Speech Recognition Learners

TL;DR

Llama-AVSR demonstrates that a decoder-only multimodal LLM can perform ASR, VSR, and AVSR by freezing large pre-trained audio and video encoders and connecting them to the LLM via lightweight modality-specific projectors and LoRA modules. The approach uses a downsampling compressor parameter to manage token load and relies on pre-trained encoders (Whisper, AV-HuBERT) and an LLM (Llama3.1-8B). On LRS3 (with VoxCeleb2 augmentation), it achieves state-of-the-art ASR and AVSR WERs while updating only tens of millions of parameters, and ablations reveal the critical roles of encoder/LLM choice, LoRA, and compression rate. The work highlights a scalable, modular pathway to leverage powerful LLMs for audio-visual speech recognition with strong practical impact in noisy environments and resource-constrained settings.

Abstract

Multimodal large language models (MLLMs) have recently become a focal point of research due to their formidable multimodal understanding capabilities. For example, in the audio and speech domains, an LLM can be equipped with (automatic) speech recognition (ASR) abilities by just concatenating the audio tokens, computed with an audio encoder, and the text tokens to achieve state-of-the-art results. On the contrary, tasks like visual and audio-visual speech recognition (VSR/AVSR), which also exploit noise-invariant lip movement information, have received little or no attention. To bridge this gap, we propose Llama-AVSR, a new MLLM with strong audio-visual speech recognition capabilities. It leverages pre-trained audio and video encoders to produce modality-specific tokens which, together with the text tokens, are processed by a pre-trained LLM (e.g., Llama3.1-8B) to yield the resulting response in an auto-regressive fashion. Llama-AVSR requires a small number of trainable parameters as only modality-specific projectors and LoRA modules are trained whereas the multi-modal encoders and LLM are kept frozen. We evaluate our proposed approach on LRS3, the largest public AVSR benchmark, and we achieve new state-of-the-art results for the tasks of ASR and AVSR with a WER of 0.79% and 0.77%, respectively. To bolster our results, we investigate the key factors that underpin the effectiveness of Llama-AVSR: the choice of the pre-trained encoders and LLM, the efficient integration of LoRA modules, and the optimal performance-efficiency trade-off obtained via modality-aware compression rates.
Paper Structure (7 sections, 1 equation, 3 figures, 3 tables)

This paper contains 7 sections, 1 equation, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Illustration of Llama-AVSR. Audio and video features are extracted via pre-trained encoders and subsequently downsampled and projected into the LLM space through modality-specific projectors. The resulting audio and video tokens are concatenated with the textual ones and processed by the LLM. The video encoder is equipped with the LoRA module only for the VSR task (dashed outline). and means that the block is trained and kept frozen, respectively.
  • Figure 2: WERs of multiple Llama-AVSR configurations for ASR task.
  • Figure 3: WER trend as a function of $K$ for the ASR and VSR tasks.