Table of Contents
Fetching ...

AVTrustBench: Assessing and Enhancing Reliability and Robustness in Audio-Visual LLMs

Sanjoy Chowdhury, Sayan Nag, Subhrajyoti Dasgupta, Yaoting Wang, Mohamed Elhoseiny, Ruohan Gao, Dinesh Manocha

TL;DR

AVTrustBench targets the reliability and robustness of audio-visual LLMs by evaluating them along Adversarial attack, Compositional reasoning, and Modality-specific dependency. The authors construct a 600K-sample benchmark across 9 tasks and assess 13 state-of-the-art AVLLMs, revealing major gaps in trustworthiness. They introduce CAVPref, a model-agnostic calibrated preference optimization with a robustness module, achieving substantial improvements (up to 30.19%) across tasks and reducing modality biases. The work provides a practical, publicly released benchmark and training strategy to guide future development of reliable AVLLMs in real-world scenarios.

Abstract

With the rapid advancement of Multi-modal Large Language Models (MLLMs), several diagnostic benchmarks have recently been developed to assess these models' multi-modal reasoning proficiency. However, these benchmarks are restricted to assessing primarily the visual aspect and do not examine the holistic audio-visual (AV) understanding. Moreover, currently, there are no benchmarks that investigate the capabilities of AVLLMs to calibrate their responses when presented with perturbed inputs. To this end, we introduce Audio-Visual Trustworthiness assessment Benchmark (AVTrustBench), comprising 600K samples spanning over 9 meticulously crafted tasks, evaluating the capabilities of AVLLMs across three distinct dimensions: Adversarial attack, Compositional reasoning, and Modality-specific dependency. Using our benchmark we extensively evaluate 13 state-of-the-art AVLLMs. The findings reveal that the majority of existing models fall significantly short of achieving human-like comprehension, offering valuable insights for future research directions. To alleviate the limitations in the existing approaches, we further propose a robust, model-agnostic calibrated audio-visual preference optimization based training strategy CAVPref, obtaining a gain up to 30.19% across all 9 tasks. We will publicly release our code and benchmark to facilitate future research in this direction.

AVTrustBench: Assessing and Enhancing Reliability and Robustness in Audio-Visual LLMs

TL;DR

AVTrustBench targets the reliability and robustness of audio-visual LLMs by evaluating them along Adversarial attack, Compositional reasoning, and Modality-specific dependency. The authors construct a 600K-sample benchmark across 9 tasks and assess 13 state-of-the-art AVLLMs, revealing major gaps in trustworthiness. They introduce CAVPref, a model-agnostic calibrated preference optimization with a robustness module, achieving substantial improvements (up to 30.19%) across tasks and reducing modality biases. The work provides a practical, publicly released benchmark and training strategy to guide future development of reliable AVLLMs in real-world scenarios.

Abstract

With the rapid advancement of Multi-modal Large Language Models (MLLMs), several diagnostic benchmarks have recently been developed to assess these models' multi-modal reasoning proficiency. However, these benchmarks are restricted to assessing primarily the visual aspect and do not examine the holistic audio-visual (AV) understanding. Moreover, currently, there are no benchmarks that investigate the capabilities of AVLLMs to calibrate their responses when presented with perturbed inputs. To this end, we introduce Audio-Visual Trustworthiness assessment Benchmark (AVTrustBench), comprising 600K samples spanning over 9 meticulously crafted tasks, evaluating the capabilities of AVLLMs across three distinct dimensions: Adversarial attack, Compositional reasoning, and Modality-specific dependency. Using our benchmark we extensively evaluate 13 state-of-the-art AVLLMs. The findings reveal that the majority of existing models fall significantly short of achieving human-like comprehension, offering valuable insights for future research directions. To alleviate the limitations in the existing approaches, we further propose a robust, model-agnostic calibrated audio-visual preference optimization based training strategy CAVPref, obtaining a gain up to 30.19% across all 9 tasks. We will publicly release our code and benchmark to facilitate future research in this direction.
Paper Structure (43 sections, 16 equations, 19 figures, 15 tables)

This paper contains 43 sections, 16 equations, 19 figures, 15 tables.

Figures (19)

  • Figure 1: Introducing AVTrustBench and CAVPref. We present AVTrustBench, a new benchmark comprising three challenging yet unexplored axes, i.e., Adversarial Attack, Compositional Reasoning, and Modality Dependency, and evaluate SOTA Audio-Visual LLMs (AVLLMs) on this benchmark. We observe that these models demonstrate poor performances under these settings. To alleviate these limitations, we propose a novel AVLLM-agnostic preference optimization strategy CAVPref, which substantially improves the reliability and robustness of these models over existing solutions such as DPO. : VideoLLaMA2 model.
  • Figure 2: AVTrustBench statistics and AVLLMs leaderboard. (Left) Task-wise data distribution. Our benchmark comprises 9 diverse tasks spanning over 3 dimensions. (Right) Performance comparison on AVTrustBench. Values represent dimension-wise averages.
  • Figure 3: Task definitions:AVTrustBench comprises a total of 9 tasks tasks MCIT, ICIT, MVIT and MAIT from Adversarial attack, COT-Stitch, COT-Swap and CAT from Compositional reasoning and MAT and MVT from Modality-specific dependency respectively. The goal of each dimension is to critically assess the robustness of existing AVLLMs under different modes of challenges. In each case, the AVLLMs are presented with a multiple-choice question setup. Refer to Sec. \ref{['taxonomy']} for task-specific details.
  • Figure 4: Qualitative results. We report top 8 models' performance on three representative tasks MCIT, COT-Swap and MAT. GPT-4o consistently outperforms open-source models. Under instruction setting we append the phrase "If the correct answer is not present respond with None of the above". More qualitative results can be found in the supplementary.
  • Figure 5: Overview of CAVPref. We formulate a distributionally robust AV preferential optimization objective to incorporate the multi-modal relationships across different modalities and counter the tailing effect across diverse categories in the dataset.
  • ...and 14 more figures