Table of Contents
Fetching ...

Adapter-Based Multi-Agent AVSR Extension for Pre-Trained ASR Models

Christopher Simic, Korbinian Riedhammer, Tobias Bocklet

TL;DR

This work introduces an adapter-based extension of a pre-trained ASR model (Whisper) to enable audio-visual speech recognition without fine-tuning the base model. It utilizes LoRa adapters arranged into adapter-sets tailored to specific noise categories or SNR ranges, along with an AV fusion module and a noise-scenario classifier to dynamically select the appropriate set. The approach achieves near state-of-the-art performance with up to 88.5% fewer trainable parameters compared to full fine-tuning baselines and demonstrates the value of scenario-specific adapters, including robust performance when visual input is unavailable. The method is validated on LRS3, VoxCeleb2, and Musan-derived data, and is easily extendable with additional adapter-sets for new noise conditions, offering practical scalability for robust AVSR systems.

Abstract

We present an approach to Audio-Visual Speech Recognition that builds on a pre-trained Whisper model. To infuse visual information into this audio-only model, we extend it with an AV fusion module and LoRa adapters, one of the most up-to-date adapter approaches. One advantage of adapter-based approaches, is that only a relatively small number of parameters are trained, while the basic model remains unchanged. Common AVSR approaches train single models to handle several noise categories and noise levels simultaneously. Taking advantage of the lightweight nature of adapter approaches, we train noise-scenario-specific adapter-sets, each covering individual noise-categories or a specific noise-level range. The most suitable adapter-set is selected by previously classifying the noise-scenario. This enables our models to achieve an optimum coverage across different noise-categories and noise-levels, while training only a minimum number of parameters. Compared to a full fine-tuning approach with SOTA performance our models achieve almost comparable results over the majority of the tested noise-categories and noise-levels, with up to 88.5% less trainable parameters. Our approach can be extended by further noise-specific adapter-sets to cover additional noise scenarios. It is also possible to utilize the underlying powerful ASR model when no visual information is available, as it remains unchanged.

Adapter-Based Multi-Agent AVSR Extension for Pre-Trained ASR Models

TL;DR

This work introduces an adapter-based extension of a pre-trained ASR model (Whisper) to enable audio-visual speech recognition without fine-tuning the base model. It utilizes LoRa adapters arranged into adapter-sets tailored to specific noise categories or SNR ranges, along with an AV fusion module and a noise-scenario classifier to dynamically select the appropriate set. The approach achieves near state-of-the-art performance with up to 88.5% fewer trainable parameters compared to full fine-tuning baselines and demonstrates the value of scenario-specific adapters, including robust performance when visual input is unavailable. The method is validated on LRS3, VoxCeleb2, and Musan-derived data, and is easily extendable with additional adapter-sets for new noise conditions, offering practical scalability for robust AVSR systems.

Abstract

We present an approach to Audio-Visual Speech Recognition that builds on a pre-trained Whisper model. To infuse visual information into this audio-only model, we extend it with an AV fusion module and LoRa adapters, one of the most up-to-date adapter approaches. One advantage of adapter-based approaches, is that only a relatively small number of parameters are trained, while the basic model remains unchanged. Common AVSR approaches train single models to handle several noise categories and noise levels simultaneously. Taking advantage of the lightweight nature of adapter approaches, we train noise-scenario-specific adapter-sets, each covering individual noise-categories or a specific noise-level range. The most suitable adapter-set is selected by previously classifying the noise-scenario. This enables our models to achieve an optimum coverage across different noise-categories and noise-levels, while training only a minimum number of parameters. Compared to a full fine-tuning approach with SOTA performance our models achieve almost comparable results over the majority of the tested noise-categories and noise-levels, with up to 88.5% less trainable parameters. Our approach can be extended by further noise-specific adapter-sets to cover additional noise scenarios. It is also possible to utilize the underlying powerful ASR model when no visual information is available, as it remains unchanged.

Paper Structure

This paper contains 17 sections, 2 equations, 1 figure, 1 table.

Figures (1)

  • Figure 1: Overall model architecture. AVSR model on the left, including the frozen, pre-trained ASR model (gray) and the selected set of LoRa adapters (orange) and the AV fusion module (green). Noise-scenario-classifier (blue) to select the most suitable adapter-set on the right.