Table of Contents
Fetching ...

Multifaceted Evaluation of Audio-Visual Capability for MLLMs: Effectiveness, Efficiency, Generalizability and Robustness

Yusheng Zhao, Junyu Luo, Xiao Luo, Weizhi Zhang, Zhiping Xiao, Wei Ju, Philip S. Yu, Ming Zhang

TL;DR

The paper presents a four-dimensional framework to evaluate audio-visual capabilities in multi-modal large language models, focusing on effectiveness, efficiency, generalizability, and robustness. It benchmarks two state-of-the-art MLLMs—VideoLLaMA 2 and VITA—on Kinetics50 and VGGSound with corrupted and adversarial variants, revealing competitive audio-visual performance but a strong reliance on the visual modality. The study demonstrates strong zero-shot and few-shot generalization yet shows significant vulnerability to test-time visual distribution shifts, and it finds that MLLMs are more robust to adversarial perturbations than traditional baselines. These findings highlight practical implications for deploying MLLMs in real-world scenarios and point to avenues for reducing vision-dominance and improving efficiency in future work.

Abstract

Multi-modal large language models (MLLMs) have recently achieved great success in processing and understanding information from diverse modalities (e.g., text, audio, and visual signals). Despite their growing popularity, there remains a lack of comprehensive evaluation measuring the audio-visual capabilities of these models, especially in diverse scenarios (e.g., distribution shifts and adversarial attacks). In this paper, we present a multifaceted evaluation of the audio-visual capability of MLLMs, focusing on four key dimensions: effectiveness, efficiency, generalizability, and robustness. Through extensive experiments, we find that MLLMs exhibit strong zero-shot and few-shot generalization abilities, enabling them to achieve great performance with limited data. However, their success relies heavily on the vision modality, which impairs performance when visual input is corrupted or missing. Additionally, while MLLMs are susceptible to adversarial samples, they demonstrate greater robustness compared to traditional models. The experimental results and our findings provide insights into the audio-visual capabilities of MLLMs, highlighting areas for improvement and offering guidance for future research.

Multifaceted Evaluation of Audio-Visual Capability for MLLMs: Effectiveness, Efficiency, Generalizability and Robustness

TL;DR

The paper presents a four-dimensional framework to evaluate audio-visual capabilities in multi-modal large language models, focusing on effectiveness, efficiency, generalizability, and robustness. It benchmarks two state-of-the-art MLLMs—VideoLLaMA 2 and VITA—on Kinetics50 and VGGSound with corrupted and adversarial variants, revealing competitive audio-visual performance but a strong reliance on the visual modality. The study demonstrates strong zero-shot and few-shot generalization yet shows significant vulnerability to test-time visual distribution shifts, and it finds that MLLMs are more robust to adversarial perturbations than traditional baselines. These findings highlight practical implications for deploying MLLMs in real-world scenarios and point to avenues for reducing vision-dominance and improving efficiency in future work.

Abstract

Multi-modal large language models (MLLMs) have recently achieved great success in processing and understanding information from diverse modalities (e.g., text, audio, and visual signals). Despite their growing popularity, there remains a lack of comprehensive evaluation measuring the audio-visual capabilities of these models, especially in diverse scenarios (e.g., distribution shifts and adversarial attacks). In this paper, we present a multifaceted evaluation of the audio-visual capability of MLLMs, focusing on four key dimensions: effectiveness, efficiency, generalizability, and robustness. Through extensive experiments, we find that MLLMs exhibit strong zero-shot and few-shot generalization abilities, enabling them to achieve great performance with limited data. However, their success relies heavily on the vision modality, which impairs performance when visual input is corrupted or missing. Additionally, while MLLMs are susceptible to adversarial samples, they demonstrate greater robustness compared to traditional models. The experimental results and our findings provide insights into the audio-visual capabilities of MLLMs, highlighting areas for improvement and offering guidance for future research.

Paper Structure

This paper contains 17 sections, 2 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: The framework of our evaluation of audio-visual capabilities of MLLMs. The MLLM takes audio signals, video frames and textual instructions as inputs and generates the corresponding output.
  • Figure 2: Data efficiency comparison of various models. We compare the models' performance under limited fine-tuning data, and show the results on the Kinetics50 (a) and VGGSound (b) datasets.
  • Figure 3: Visualization of input video frames and audio signals. The clean video frames and audio signals are shown in subfigures (a) and (d), while the corrupted versions are shown in subfigures (b), (c), and (e).
  • Figure 4: An example where the model generates the correct answer. The input video frames and audio signals are shown in subfigures (a) and (b), the textual prompt is shown in subfigure (c) and the model's output is shown in subfigure (d).
  • Figure 5: An example of the model's confusion between speech and textual instructions. We also show the transcript of the audio signals in subfigure (b).
  • ...and 1 more figures