CAT: Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual Scenarios

Qilang Ye; Zitong Yu; Rui Shao; Xinyu Xie; Philip Torr; Xiaochun Cao

CAT: Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual Scenarios

Qilang Ye, Zitong Yu, Rui Shao, Xinyu Xie, Philip Torr, Xiaochun Cao

TL;DR

The paper tackles answering questions in dynamic audio-visual scenes, where existing MLLMs often produce ambiguous or vague responses. It introduces CAT, a three-branch architecture with a clue aggregator to harness question-relevant cues, a mixed audio-visual training regimen including AVinstruct data, and AI-assisted Ambiguity-aware Direct Preference Optimization (ADPO) to bias against unclear answers. Empirical results show CAT achieves state-of-the-art or competitive performance across video-based generation, zero-shot video QA, and both closed- and open-ended AVQA tasks, with ablations validating the effectiveness of clues, aggregation, and ADPO. The work provides a practical pathway for more precise, grounded AV reasoning and releases datasets and code to support further development in AVQA research.

Abstract

This paper focuses on the challenge of answering questions in scenarios that are composed of rich and complex dynamic audio-visual components. Although existing Multimodal Large Language Models (MLLMs) can respond to audio-visual content, these responses are sometimes ambiguous and fail to describe specific audio-visual events. To overcome this limitation, we introduce the CAT, which enhances MLLM in three ways: 1) besides straightforwardly bridging audio and video, we design a clue aggregator that aggregates question-related clues in dynamic audio-visual scenarios to enrich the detailed knowledge required for large language models. 2) CAT is trained on a mixed multimodal dataset, allowing direct application in audio-visual scenarios. Notably, we collect an audio-visual joint instruction dataset named AVinstruct, to further enhance the capacity of CAT to model cross-semantic correlations. 3) we propose AI-assisted ambiguity-aware direct preference optimization, a strategy specialized in retraining the model to favor the non-ambiguity response and improve the ability to localize specific audio-visual objects. Extensive experimental results demonstrate that CAT outperforms existing methods on multimodal tasks, especially in Audio-Visual Question Answering (AVQA) tasks. The codes and the collected instructions are released at https://github.com/rikeilong/Bay-CAT.

CAT: Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual Scenarios

TL;DR

Abstract

Paper Structure (14 sections, 8 equations, 6 figures, 8 tables)

This paper contains 14 sections, 8 equations, 6 figures, 8 tables.

Introduction
Related Works
Our Approach
Multimodal Inputs
Aggregating Key Clues
Multimodal Training Strategy
AI-assisted Ambiguity-aware Direct Preference Optimization
Experiments
Datasets
Experimental Setup
Comparison to State-of-the-Art
Ablations and Analyses
Qualitative Analysis
Conclusion

Figures (6)

Figure 1: Comparison between existing MLLMs and CAT . Red words for incorrect response, Green words for correct response, and Gray words for useless response. Left: Most of the existing MLLMs straightforwardly bridge multimodal to large language models. Instead, CAT builds on this foundation by designing the clue aggregator to learn more detailed knowledge related to the question. Moreover, CAT constrains itself to learn a sharper response through AI-assisted ambiguity-aware direct preference optimization. Right: In comparison with audio-visual-language models Video-LLaMA videollama and ChatBridge chatbridge, our method accurately recognizes the answers to questions with the most streamlined responses.
Figure 2: Illustration of the proposed CAT and its training strategy. (a) Overview of CAT. CAT first extracts overall audio-visual knowledge from video and audio and transforms them into visual tokens $x^{vid}$ and audio tokens $x^{aud}$. We input question tagged with <Q></Q> in the prompt into the clue aggregator, aiming to aggregate question-aware audio-visual hidden features and yield clue tokens $x^{cue}$. Finally, we merge multimodal tokens and language and feed into the frozen large language model with LoRA lora to output the response. (b) The training paradigm of CAT involves pre-alignment of the audio-visual projectors and instruction tuning on the entire model.
Figure 3: Illustration of clue aggregator in a simple example.
Figure 4: Trained-CAT denotes CAT after feature alignment and instruction tuning. Our proposed ADPO strategy involves two steps. First, we collect the negative response generated by trained-CAT and correct it by GPT to obtain a positive response based on the original answer. Second, we perform ADPO training to skew CAT toward positive responses and reject negative responses.
Figure 5: The impacts of input modal tokens. Avg. represents the average accuracy of temporal (Te.), consistency (CS.), and detail (De.).
...and 1 more figures

CAT: Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual Scenarios

TL;DR

Abstract

CAT: Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual Scenarios

Authors

TL;DR

Abstract

Table of Contents

Figures (6)