Table of Contents
Fetching ...

Meerkat: Audio-Visual Large Language Model for Grounding in Space and Time

Sanjoy Chowdhury, Sayan Nag, Subhrajyoti Dasgupta, Jun Chen, Mohamed Elhoseiny, Ruohan Gao, Dinesh Manocha

TL;DR

Meerkat tackles the problem of fine-grained audio-visual grounding by unifying spatial and temporal understanding within a single large language model framework. It introduces AVOpT, an optimal-transport-based weak alignment mechanism, and AVACE, an attention-consistency module, to jointly fuse image and audio representations for precise grounding. The authors also provide AVFIT, a 3M-instruction tuning dataset, and MeerkatBench, a benchmark suite spanning five AV tasks, enabling comprehensive multi-task evaluation. Empirical results show state-of-the-art performance across all tasks, with notable gains over strong baselines and robust ablations validating the design choices. This work paves the way for scalable, fine-grained AV reasoning with LLMs and releases datasets and benchmarks to support reproducibility and further research.

Abstract

Leveraging Large Language Models' remarkable proficiency in text-based tasks, recent works on Multi-modal LLMs (MLLMs) extend them to other modalities like vision and audio. However, the progress in these directions has been mostly focused on tasks that only require a coarse-grained understanding of the audio-visual semantics. We present Meerkat, an audio-visual LLM equipped with a fine-grained understanding of image and audio both spatially and temporally. With a new modality alignment module based on optimal transport and a cross-attention module that enforces audio-visual consistency, Meerkat can tackle challenging tasks such as audio referred image grounding, image guided audio temporal localization, and audio-visual fact-checking. Moreover, we carefully curate a large dataset AVFIT that comprises 3M instruction tuning samples collected from open-source datasets, and introduce MeerkatBench that unifies five challenging audio-visual tasks. We achieve state-of-the-art performance on all these downstream tasks with a relative improvement of up to 37.12%.

Meerkat: Audio-Visual Large Language Model for Grounding in Space and Time

TL;DR

Meerkat tackles the problem of fine-grained audio-visual grounding by unifying spatial and temporal understanding within a single large language model framework. It introduces AVOpT, an optimal-transport-based weak alignment mechanism, and AVACE, an attention-consistency module, to jointly fuse image and audio representations for precise grounding. The authors also provide AVFIT, a 3M-instruction tuning dataset, and MeerkatBench, a benchmark suite spanning five AV tasks, enabling comprehensive multi-task evaluation. Empirical results show state-of-the-art performance across all tasks, with notable gains over strong baselines and robust ablations validating the design choices. This work paves the way for scalable, fine-grained AV reasoning with LLMs and releases datasets and benchmarks to support reproducibility and further research.

Abstract

Leveraging Large Language Models' remarkable proficiency in text-based tasks, recent works on Multi-modal LLMs (MLLMs) extend them to other modalities like vision and audio. However, the progress in these directions has been mostly focused on tasks that only require a coarse-grained understanding of the audio-visual semantics. We present Meerkat, an audio-visual LLM equipped with a fine-grained understanding of image and audio both spatially and temporally. With a new modality alignment module based on optimal transport and a cross-attention module that enforces audio-visual consistency, Meerkat can tackle challenging tasks such as audio referred image grounding, image guided audio temporal localization, and audio-visual fact-checking. Moreover, we carefully curate a large dataset AVFIT that comprises 3M instruction tuning samples collected from open-source datasets, and introduce MeerkatBench that unifies five challenging audio-visual tasks. We achieve state-of-the-art performance on all these downstream tasks with a relative improvement of up to 37.12%.
Paper Structure (9 sections, 4 equations, 6 figures, 10 tables, 1 algorithm)

This paper contains 9 sections, 4 equations, 6 figures, 10 tables, 1 algorithm.

Figures (6)

  • Figure 1: We present Meerkat, an audio-visual LLM that can effectively ground both spatially and temporally in image and audio. Our model is adept in tasks that require fine-grained understanding such as (130,4)Audio Referred Image Grounding, (192,4)Image Guided (IG) Audio Temporal Localization & (128,4)Audio-Visual (AV) Fact-checking. It can also be extended to perform coarse-grained tasks like (20,4)AVQA & (55,4)AV Captioning.
  • Figure 2: Overview of Meerkat. Our model is equipped with fine-grained audio-visual comprehension abilities. When fed with image I, audio A pairs, the Audio-Visual Optimal Transport alignment (AVOpT) module B learns the patch-wise image-audio association to facilitate weak alignment between the two modalities by minimizing the patch-level Wasserstein distance. Subsequently, the Audio-Visual Attention Consistency Enforcement (AVACE) module A maximizes the region-level alignment by confining the cross-modal attention maps around the objects of interest and minimizing the association with the background. After tokenizing the text instruction T, the modality-specific latents ($\tilde{z}_{I}, \tilde{z}_{A}, z_{T}$) are passed to the instruction tuned Llama 2 model which serves as a unified interface for the downstream tasks. We employ a LoRA-based fine-tuning of the LLM.
  • Figure 3: Qualitative results. We compare our method against its closest baselines on all downstream tasks. Meerkat aided by our novel design approach and instruction tuning datasets achieves superior performance on spatio-temporal grounding as well as coarse-grained tasks by outperforming prior approaches.
  • Figure 4: cIoU upper bound on VGG-SS for Full vs. LoRA based finetuning.
  • Figure 5: Task-wise dataset distribution. The bi-coloured cells denote collections of paired image-audio samples from public datasets following our data curation strategy while single-coloured cells signify direct adaptation. Datasets with dashed outlines are used only during model training while the ones with are reserved for zero-shot evaluations. Other datasets have a defined train/test split. Numbers in the bottom right represent the total #samples present in each task.
  • ...and 1 more figures