Fine-grained Audio-Visual Joint Representations for Multimodal Large Language Models

Guangzhi Sun; Wenyi Yu; Changli Tang; Xianzhao Chen; Tian Tan; Wei Li; Lu Lu; Zejun Ma; Chao Zhang

Fine-grained Audio-Visual Joint Representations for Multimodal Large Language Models

Guangzhi Sun, Wenyi Yu, Changli Tang, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, Chao Zhang

TL;DR

This work tackles the challenge of enabling multimodal LLMs to process audio and visual streams at fine temporal granularity, including speech and non-speech audio events, not just sparse video frames. The authors propose FAVOR, a framework that synchronises audio and visual inputs at frame level, uses a sliding-window strategy, and employs a causal Q-Former with a causal self-attention module to capture temporal causality, aligning the fused representation with LLM embeddings. They introduce AVEB, a comprehensive benchmark combining single-modal and cross-modal tasks to evaluate audio-visual understanding and co-reasoning. Empirical results show FAVOR achieves competitive single-modal performance and large gains on cross-modal tasks, including substantial improvements in Video QA and speech-vision interactions, demonstrating the model's cross-modal cognitive capabilities.

Abstract

Audio-visual large language models (LLM) have drawn significant attention, yet the fine-grained combination of both input streams is rather under-explored, which is challenging but necessary for LLMs to understand general video inputs. To this end, a fine-grained audio-visual joint representation (FAVOR) learning framework for multimodal LLMs is proposed in this paper, which extends a text-based LLM to simultaneously perceive speech and audio events in the audio input stream and images or videos in the visual input stream, at the frame level. To fuse the audio and visual feature streams into joint representations and to align the joint space with the LLM input embedding space, we propose a causal Q-Former structure with a causal attention module to enhance the capture of causal relations of the audio-visual frames across time. An audio-visual evaluation benchmark (AVEB) is also proposed which comprises six representative single-modal tasks with five cross-modal tasks reflecting audio-visual co-reasoning abilities. While achieving competitive single-modal performance on audio, speech and image tasks in AVEB, FAVOR achieved over 20% accuracy improvements on the video question-answering task when fine-grained information or temporal causal reasoning is required. FAVOR, in addition, demonstrated remarkable video comprehension and reasoning abilities on tasks that are unprecedented by other multimodal LLMs. An interactive demo of FAVOR is available at https://github.com/BriansIDP/AudioVisualLLM.git, and the training code and model checkpoints will be released soon.

Fine-grained Audio-Visual Joint Representations for Multimodal Large Language Models

TL;DR

Abstract

Paper Structure (24 sections, 5 equations, 12 figures, 9 tables)

This paper contains 24 sections, 5 equations, 12 figures, 9 tables.

Introduction
Related Work
Methodology
Model Architecture
Q-Former with Causal Self-Attention
System Training and Diversity Loss
Experimental Setup
Audio-Visual Evaluation Benchmark (AVEB)
Model Configurations
Training Data and Specifications
Experimental Results
Main Results
Ablation Studies
Analysis on the Sliding Window Size
Analysis of the Diversity Loss
...and 9 more sections

Figures (12)

Figure 1: The fine-grained audio-visual joint representation (FAVOR) learning framework for multimodal LLMs. The temporal synchronisation module does not contain trainable parameters, and the audio and visual feature encoders are not updated during training.
Figure 2: The causal attention module in the causal Q-Former with a block-wise triangular causal mask (grey cells are masked). The number of features per frame here is 2 as an example.
Figure 3: Influence of the window sizes and the frames per second (FPS) to the model performance on speech and video tasks. (a) and (b): results by training and evaluating using different window sizes $k$ on 10% of data. (c): the influence of FPS using the best model on full data.
Figure 4: Variations of model performance due to the diversity loss factor, i.e.$\lambda$ in Eqn. (\ref{['eq:div']}), on (a) AVSR measured in %WER, (b) Video QA measured in %Accuracy and (c) AVSSD measured in %Accuracy. Variations of average cosine similarities are also shown under different $\lambda$'s.
Figure 5: Visualisation of cosine similarity matrix with different diversity loss factors.
...and 7 more figures

Fine-grained Audio-Visual Joint Representations for Multimodal Large Language Models

TL;DR

Abstract

Fine-grained Audio-Visual Joint Representations for Multimodal Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (12)