Table of Contents
Fetching ...

WAVE: Learning Unified & Versatile Audio-Visual Embeddings with Multimodal LLM

Changli Tang, Qinfan Xiao, Ke Mei, Tianyi Wang, Fengyun Rao, Chao Zhang

TL;DR

WAVE is introduced, the first LLM-based embedding that creates a unified representation space for text, audio, and video modalities and significantly outperforms existing embedding models in multimodal question answering.

Abstract

While embeddings from multimodal large language models (LLMs) excel as general-purpose representations, their application to dynamic modalities like audio and video remains underexplored. We introduce WAVE (\textbf{u}nified \& \textbf{v}ersatile \textbf{a}udio-\textbf{v}isual \textbf{e}mbeddings), the first LLM-based embedding that creates a unified representation space for text, audio, and video modalities. WAVE employs a novel hierarchical feature fusion strategy and a joint multi-modal, multi-task training approach to enable two key capabilities: any-to-any cross-modal retrieval and the generation of prompt-aware embeddings tailored to user instructions. Experimentally, WAVE sets a new state-of-the-art on the MMEB-v2 video benchmark and achieves superior results in audio and video-to-audio retrieval. Its prompt-aware nature also yields remarkable performance in multimodal question answering, significantly outperforming existing embedding models. Ablation studies validate our joint training strategy, demonstrating improved performance across all modalities. With a newly introduced benchmark for versatile audio-visual learning, WAVE opens up broad possibilities for cross-modal, any-to-any applications. Our code and checkpoints are released at \href{https://github.com/TCL606/WAVE}{https://github.com/TCL606/WAVE}.

WAVE: Learning Unified & Versatile Audio-Visual Embeddings with Multimodal LLM

TL;DR

WAVE is introduced, the first LLM-based embedding that creates a unified representation space for text, audio, and video modalities and significantly outperforms existing embedding models in multimodal question answering.

Abstract

While embeddings from multimodal large language models (LLMs) excel as general-purpose representations, their application to dynamic modalities like audio and video remains underexplored. We introduce WAVE (\textbf{u}nified \& \textbf{v}ersatile \textbf{a}udio-\textbf{v}isual \textbf{e}mbeddings), the first LLM-based embedding that creates a unified representation space for text, audio, and video modalities. WAVE employs a novel hierarchical feature fusion strategy and a joint multi-modal, multi-task training approach to enable two key capabilities: any-to-any cross-modal retrieval and the generation of prompt-aware embeddings tailored to user instructions. Experimentally, WAVE sets a new state-of-the-art on the MMEB-v2 video benchmark and achieves superior results in audio and video-to-audio retrieval. Its prompt-aware nature also yields remarkable performance in multimodal question answering, significantly outperforming existing embedding models. Ablation studies validate our joint training strategy, demonstrating improved performance across all modalities. With a newly introduced benchmark for versatile audio-visual learning, WAVE opens up broad possibilities for cross-modal, any-to-any applications. Our code and checkpoints are released at \href{https://github.com/TCL606/WAVE}{https://github.com/TCL606/WAVE}.

Paper Structure

This paper contains 26 sections, 5 equations, 2 figures, 10 tables.

Figures (2)

  • Figure 1: Inputs can be text-only, vision-only, audio-only, or audio–visual. For text-only cases, the final embeddings are obtained via last-token pooling over the LLM’s last hidden states. For multimodal inputs, the last output tokens from all LLM layers are concatenated and passed to a feature-fusion module to produce a unified multimodal embedding. Note that text prompts are always provided to instruct the LLM for multimodal inputs.
  • Figure 2: A heatmap visualizing the cosine similarity between video embeddings (V1-V4) and text embeddings (T1-T4). All four video embeddings are generated from the same video but conditioned on different textual prompts. The text embeddings represent various concepts present in the video.