Table of Contents
Fetching ...

Watch and Listen: Understanding Audio-Visual-Speech Moments with Multimodal LLM

Zinuo Li, Xian Zhang, Yongxin Guo, Mohammed Bennamoun, Farid Boussaid, Girish Dwivedi, Luqi Gong, Qiuhong Ke

TL;DR

This work tackles robust multimodal temporal video understanding by introducing TriSense, a triple-modality LLM with a Query-Based Connector that adaptively fuses vision, audio, and speech based on the query. To support learning across diverse modality configurations and long-form content, the authors build TriSense-2M, a 2-million-sample omni-modal dataset with long videos and explicit modality dropout handling. Empirical results show TriSense achieving state-of-the-art performance on segment captioning and moment retrieval across multiple modality configurations, demonstrating strong generalization under missing or noisy inputs. The approach advances practical video analysis by enabling flexible, query-driven multimodal reasoning and temporal alignment, with public code and data planned.

Abstract

Humans naturally understand moments in a video by integrating visual and auditory cues. For example, localizing a scene in the video like "A scientist passionately speaks on wildlife conservation as dramatic orchestral music plays, with the audience nodding and applauding" requires simultaneous processing of visual, audio, and speech signals. However, existing models often struggle to effectively fuse and interpret audio information, limiting their capacity for comprehensive video temporal understanding. To address this, we present TriSense, a triple-modality large language model designed for holistic video temporal understanding through the integration of visual, audio, and speech modalities. Central to TriSense is a Query-Based Connector that adaptively reweights modality contributions based on the input query, enabling robust performance under modality dropout and allowing flexible combinations of available inputs. To support TriSense's multimodal capabilities, we introduce TriSense-2M, a high-quality dataset of over 2 million curated samples generated via an automated pipeline powered by fine-tuned LLMs. TriSense-2M includes long-form videos and diverse modality combinations, facilitating broad generalization. Extensive experiments across multiple benchmarks demonstrate the effectiveness of TriSense and its potential to advance multimodal video analysis. Code and dataset will be publicly released.

Watch and Listen: Understanding Audio-Visual-Speech Moments with Multimodal LLM

TL;DR

This work tackles robust multimodal temporal video understanding by introducing TriSense, a triple-modality LLM with a Query-Based Connector that adaptively fuses vision, audio, and speech based on the query. To support learning across diverse modality configurations and long-form content, the authors build TriSense-2M, a 2-million-sample omni-modal dataset with long videos and explicit modality dropout handling. Empirical results show TriSense achieving state-of-the-art performance on segment captioning and moment retrieval across multiple modality configurations, demonstrating strong generalization under missing or noisy inputs. The approach advances practical video analysis by enabling flexible, query-driven multimodal reasoning and temporal alignment, with public code and data planned.

Abstract

Humans naturally understand moments in a video by integrating visual and auditory cues. For example, localizing a scene in the video like "A scientist passionately speaks on wildlife conservation as dramatic orchestral music plays, with the audience nodding and applauding" requires simultaneous processing of visual, audio, and speech signals. However, existing models often struggle to effectively fuse and interpret audio information, limiting their capacity for comprehensive video temporal understanding. To address this, we present TriSense, a triple-modality large language model designed for holistic video temporal understanding through the integration of visual, audio, and speech modalities. Central to TriSense is a Query-Based Connector that adaptively reweights modality contributions based on the input query, enabling robust performance under modality dropout and allowing flexible combinations of available inputs. To support TriSense's multimodal capabilities, we introduce TriSense-2M, a high-quality dataset of over 2 million curated samples generated via an automated pipeline powered by fine-tuned LLMs. TriSense-2M includes long-form videos and diverse modality combinations, facilitating broad generalization. Extensive experiments across multiple benchmarks demonstrate the effectiveness of TriSense and its potential to advance multimodal video analysis. Code and dataset will be publicly released.

Paper Structure

This paper contains 25 sections, 6 equations, 12 figures, 10 tables, 2 algorithms.

Figures (12)

  • Figure 1: TriSense supports segment captioning and moment retrieval for videos from audio, visual, and speech modalities, as well as any combination of them, covering a total of eight different tasks.
  • Figure 2: We employ an automated framework to build our dataset by leveraging modality-specific captions from vision, audio, and speech streams. Two large language models (LLMs) are trained for this process: a Generator, which fuses the three input captions into multi-modal outputs (AVS, AV, VS), and a Judger, which evaluates the semantic quality of the generated captions. The Judger assigns an average quality score between 0 and 5 based on alignment with the original inputs. Samples scoring $\geq 3$ are retained, while those scoring < 3 are discarded.
  • Figure 3: Video duration distribution. Most videos are 10–20 minutes long (83.5%), supporting long-form temporal understanding.
  • Figure 4: Architecture of the TriSense model. The model processes vision, audio, and speech via dedicated encoders and fuses them using a Query-Based Connector that assigns weights based on the query. The fused output, combined with temporal embeddings, is passed to an LLM for generating timestamped or textual responses.
  • Figure 5: Prompts used for training the Generator and Judger. The left prompt guides GPT in generating omni-modal captions for the Generator using audio, visual, and speech inputs. The right prompt is used to train the Judger by instructing GPT to assess the quality of generated captions based on coverage, accuracy, and paraphrasing. During data creation, samples are randomly selected and manually filtered to ensure high-quality training data.
  • ...and 7 more figures