Table of Contents
Fetching ...

Dr. SHAP-AV: Decoding Relative Modality Contributions via Shapley Attribution in Audio-Visual Speech Recognition

Umberto Cappellazzo, Stavros Petridis, Maja Pantic

Abstract

Audio-Visual Speech Recognition (AVSR) leverages both acoustic and visual information for robust recognition under noise. However, how models balance these modalities remains unclear. We present Dr. SHAP-AV, a framework using Shapley values to analyze modality contributions in AVSR. Through experiments on six models across two benchmarks and varying SNR levels, we introduce three analyses: Global SHAP for overall modality balance, Generative SHAP for contribution dynamics during decoding, and Temporal Alignment SHAP for input-output correspondence. Our findings reveal that models shift toward visual reliance under noise yet maintain high audio contributions even under severe degradation. Modality balance evolves during generation, temporal alignment holds under noise, and SNR is the dominant factor driving modality weighting. These findings expose a persistent audio bias, motivating ad-hoc modality-weighting mechanisms and Shapley-based attribution as a standard AVSR diagnostic.

Dr. SHAP-AV: Decoding Relative Modality Contributions via Shapley Attribution in Audio-Visual Speech Recognition

Abstract

Audio-Visual Speech Recognition (AVSR) leverages both acoustic and visual information for robust recognition under noise. However, how models balance these modalities remains unclear. We present Dr. SHAP-AV, a framework using Shapley values to analyze modality contributions in AVSR. Through experiments on six models across two benchmarks and varying SNR levels, we introduce three analyses: Global SHAP for overall modality balance, Generative SHAP for contribution dynamics during decoding, and Temporal Alignment SHAP for input-output correspondence. Our findings reveal that models shift toward visual reliance under noise yet maintain high audio contributions even under severe degradation. Modality balance evolves during generation, temporal alignment holds under noise, and SNR is the dominant factor driving modality weighting. These findings expose a persistent audio bias, motivating ad-hoc modality-weighting mechanisms and Shapley-based attribution as a standard AVSR diagnostic.
Paper Structure (22 sections, 8 equations, 8 figures, 1 table)

This paper contains 22 sections, 8 equations, 8 figures, 1 table.

Figures (8)

  • Figure 1: WER comparison ($\downarrow$) when ablating audio or video modalities across six state-of-the-art AVSR models in clean conditions. Note that the x-axis is displayed on a logarithmic scale.
  • Figure 2: Overview of the three proposed SHAP-based analyses in Dr. SHAP-AV. From the Shapley matrix $\bm{\Phi}$, which captures the contribution of each input feature (rows) to each generated token (columns), we compute: (Left)GlobalSHAP, which aggregates contributions across all features and tokens to quantify overall modality balance; (Middle)GenerativeSHAP, which tracks modality contribution dynamics across token generation stages; and (Right)TemporalAlignmentSHAP, which examines the correspondence between input feature positions and output token positions.
  • Figure 3: (Left): Global audio/video contributions using Permutation SHAP for six AVSR models under varying acoustic conditions on the LRS3 dataset. (Right): The same analysis using Sampling SHAP.
  • Figure 4: GenerativeSHAP analysis showing modality contributions as a function of token generation progress (%). Clean and noisy ($-10$ dB) conditions are compared.
  • Figure 5: TemporalAlignmentSHAP for AV-HuBERT. Top: audio feature heatmaps under clean (left) and noisy (right) conditions. Bottom: grouped video feature analysis (early/middle/late) under clean (left) and noisy (right) conditions.
  • ...and 3 more figures