Table of Contents
Fetching ...

$\texttt{AVROBUSTBENCH}$: Benchmarking the Robustness of Audio-Visual Recognition Models at Test-Time

Sarthak Kumar Maharana, Saksham Singh Kushwaha, Baoming Zhang, Adrian Rodriguez, Songtao Wei, Yapeng Tian, Yunhui Guo

TL;DR

AVRobustBench introduces a rigorous benchmark to assess test-time robustness of audio-visual recognition models under co-occurring, correlated shifts across audio and video. It covers four AV datasets with 75 corruptions and defines $\

Abstract

While recent audio-visual models have demonstrated impressive performance, their robustness to distributional shifts at test-time remains not fully understood. Existing robustness benchmarks mainly focus on single modalities, making them insufficient for thoroughly assessing the robustness of audio-visual models. Motivated by real-world scenarios where shifts can occur $\textit{simultaneously}$ in both audio and visual modalities, we introduce $\texttt{AVROBUSTBENCH}$, a comprehensive benchmark designed to evaluate the test-time robustness of audio-visual recognition models. $\texttt{AVROBUSTBENCH}$ comprises four audio-visual benchmark datasets, $\texttt{AUDIOSET-2C}$, $\texttt{VGGSOUND-2C}$, $\texttt{KINETICS-2C}$, and $\texttt{EPICKITCHENS-2C}$, each incorporating 75 bimodal audio-visual corruptions that are $\textit{co-occurring}$ and $\textit{correlated}$. Through extensive evaluations, we observe that state-of-the-art supervised and self-supervised audio-visual models exhibit declining robustness as corruption severity increases. Furthermore, online test-time adaptation (TTA) methods, on $\texttt{VGGSOUND-2C}$ and $\texttt{KINETICS-2C}$, offer minimal improvements in performance under bimodal corruptions. We further propose $\texttt{AV2C}$, a simple TTA approach enabling on-the-fly cross-modal fusion by penalizing high-entropy samples, which achieves improvements on $\texttt{VGGSOUND-2C}$. We hope that $\texttt{AVROBUSTBENCH}$ will steer the development of more effective and robust audio-visual TTA approaches. Our code is available $\href{https://github.com/sarthaxxxxx/AV-C-Robustness-Benchmark}{here}$.

$\texttt{AVROBUSTBENCH}$: Benchmarking the Robustness of Audio-Visual Recognition Models at Test-Time

TL;DR

AVRobustBench introduces a rigorous benchmark to assess test-time robustness of audio-visual recognition models under co-occurring, correlated shifts across audio and video. It covers four AV datasets with 75 corruptions and defines $\

Abstract

While recent audio-visual models have demonstrated impressive performance, their robustness to distributional shifts at test-time remains not fully understood. Existing robustness benchmarks mainly focus on single modalities, making them insufficient for thoroughly assessing the robustness of audio-visual models. Motivated by real-world scenarios where shifts can occur in both audio and visual modalities, we introduce , a comprehensive benchmark designed to evaluate the test-time robustness of audio-visual recognition models. comprises four audio-visual benchmark datasets, , , , and , each incorporating 75 bimodal audio-visual corruptions that are and . Through extensive evaluations, we observe that state-of-the-art supervised and self-supervised audio-visual models exhibit declining robustness as corruption severity increases. Furthermore, online test-time adaptation (TTA) methods, on and , offer minimal improvements in performance under bimodal corruptions. We further propose , a simple TTA approach enabling on-the-fly cross-modal fusion by penalizing high-entropy samples, which achieves improvements on . We hope that will steer the development of more effective and robust audio-visual TTA approaches. Our code is available .

Paper Structure

This paper contains 29 sections, 5 equations, 11 figures, 13 tables.

Figures (11)

  • Figure 1: $\textsc{AVRobustBench}$ comprises diverse and correlated audio-visual corruptions that co-occur in the real world.
  • Figure 2: Corruption severity has a large effect on model robustness; increasing severity decreases robustness. We illustrate $\rho$ with varying severity on $\textsc{AudioSet-2C}$ (top) and $\textsc{EpicKitchens-2C}$ (bottom). For $\textsc{AudioSet-2C}$, we show the performance of CAV-MAE, AudioCLIP, and ImageBind. For $\textsc{EpicKitchens-2C}$, we report $\rho$ for TBN (Noun), TBN (Verb), TIM (Noun), and TIM (Verb). The x-axis denotes corruption severity, and the y-axis denotes $\rho$. More examples, including $\textsc{Kinetics-2C}$, are in the Appendix.
  • Figure 3: ImageBind’s girdhar2023imagebind "emergent" zero-shot generalization remains ineffective even with "context"-aware prompts. We show relative accuracy change (%) i.e., ($\mathcal{A}_{cl}$ - $\mathcal{A}_{i,s}$)/$\mathcal{A}_{cl}$ w/ s=5, on $\textsc{VGGSound-2C}$, using different prompts for the text encoder: Prompt 1— a noisy audio of $<$CLS$>$.", Prompt 2— a noisy photo of $<$CLS$>$.", and Prompt 3—" a noisy photo of $<$CLS$>$ and a noisy audio of $<$CLS$>$".
  • Figure 4: Over time steps (t) during online TTA, an attention imbalance in the form of modality bias begins with AV corruptions, leading to a degrading performance of READ. Average attention weights are computed across 12 heads from 1 block of CAV-MAE's joint encoder for a batch size of 64. The numbers indicate averaged attention, scaled by 10,000. We show Gaussian on $\textsc{VGGSound-2C}$ for discussion.
  • Figure 5: State-of-the-art audio-visual segmentation models still struggle in the presence of bimodal audio and visual corruptions. We use SAMA-AVS liu2024annotation to directly infer on the AVSBench-S4 zhou2022audio test set with our proposed corruptions (severity 5). Each task includes 740 videos, and we report the absolute drops in mean intersection over union (mIoU) and F-score relative to the clean AVSBench-S4 results of SAMA-AVS.
  • ...and 6 more figures