Table of Contents
Fetching ...

DaMO: A Data-Efficient Multimodal Orchestrator for Temporal Reasoning with Video LLMs

Bo-Cheng Chiu, Jen-Jee Chen, Yu-Chee Tseng, Feng-Chi Chen, An-Zi Yen

TL;DR

DaMO introduces a data-efficient, temporally-aware video-language framework that fuses audio-visual information through a Temporal-aware Fuseformer and a global residual to preserve global context while reducing compute. A four-stage progressive training paradigm (video-text alignment, representation bridging, temporal perception learning, and dialogue tuning) enables strong temporal reasoning with limited data, aided by LLM-based augmentation of temporal QA datasets. Empirical results across zero-shot retrieval, temporal grounding, and temporally grounded dialogue demonstrate state-of-the-art performance in precise moment localization and multimodal reasoning, with notable data efficiency. The work also provides publicly releasable LLM-augmented temporal QA datasets to facilitate future research in data-efficient temporal reasoning for video-language models.

Abstract

Large Language Models (LLMs) have recently been extended to the video domain, enabling sophisticated video-language understanding. However, existing Video LLMs often exhibit limitations in fine-grained temporal reasoning, restricting their ability to precisely attribute responses to specific video moments, especially under constrained supervision. We introduce DaMO, a data-efficient Video LLM explicitly designed for accurate temporal reasoning and multimodal understanding. At its core, the proposed Temporal-aware Fuseformer employs a hierarchical dual-stream architecture that progressively captures temporal dynamics within each modality and effectively fuses complementary visual and audio information. To further enhance computational efficiency, DaMO integrates a global residual that reduces spatial redundancy while preserving essential semantic details. We train DaMO via a structured four-stage progressive training paradigm, incrementally equipping the model with multimodal alignment, semantic grounding, and temporal reasoning capabilities. This work also contributes multiple datasets augmented from existing ones with LLM-generated temporally grounded QA pairs for tasks requiring temporal supervision. Comprehensive experiments on temporal grounding and video QA benchmarks demonstrate that DaMO consistently surpasses prior methods, particularly in tasks demanding precise temporal alignment and reasoning. Our work establishes a promising direction for data-efficient video-language modeling.

DaMO: A Data-Efficient Multimodal Orchestrator for Temporal Reasoning with Video LLMs

TL;DR

DaMO introduces a data-efficient, temporally-aware video-language framework that fuses audio-visual information through a Temporal-aware Fuseformer and a global residual to preserve global context while reducing compute. A four-stage progressive training paradigm (video-text alignment, representation bridging, temporal perception learning, and dialogue tuning) enables strong temporal reasoning with limited data, aided by LLM-based augmentation of temporal QA datasets. Empirical results across zero-shot retrieval, temporal grounding, and temporally grounded dialogue demonstrate state-of-the-art performance in precise moment localization and multimodal reasoning, with notable data efficiency. The work also provides publicly releasable LLM-augmented temporal QA datasets to facilitate future research in data-efficient temporal reasoning for video-language models.

Abstract

Large Language Models (LLMs) have recently been extended to the video domain, enabling sophisticated video-language understanding. However, existing Video LLMs often exhibit limitations in fine-grained temporal reasoning, restricting their ability to precisely attribute responses to specific video moments, especially under constrained supervision. We introduce DaMO, a data-efficient Video LLM explicitly designed for accurate temporal reasoning and multimodal understanding. At its core, the proposed Temporal-aware Fuseformer employs a hierarchical dual-stream architecture that progressively captures temporal dynamics within each modality and effectively fuses complementary visual and audio information. To further enhance computational efficiency, DaMO integrates a global residual that reduces spatial redundancy while preserving essential semantic details. We train DaMO via a structured four-stage progressive training paradigm, incrementally equipping the model with multimodal alignment, semantic grounding, and temporal reasoning capabilities. This work also contributes multiple datasets augmented from existing ones with LLM-generated temporally grounded QA pairs for tasks requiring temporal supervision. Comprehensive experiments on temporal grounding and video QA benchmarks demonstrate that DaMO consistently surpasses prior methods, particularly in tasks demanding precise temporal alignment and reasoning. Our work establishes a promising direction for data-efficient video-language modeling.

Paper Structure

This paper contains 32 sections, 3 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Qualitative comparison on temporal reasoning in video-grounded QA. Given a temporal question grounded in a video clip, DaMO generates a more precise and temporally aligned response than Video-LLaMA zhang2023video and VTimeLLM huang2024vtimellm, showcasing superior temporal understanding.
  • Figure 2: Overview of DaMO. Visual and audio features extracted by pretrained encoders undergo dimensionality reduction via a global residual, with grouped convolutions further compressing visual features along the temporal dimension. Before multimodal fusion, Temporal Embeddings are explicitly added to the modality-specific features. The Temporal-aware Fuseformer is designed to explicitly refine and integrate multimodal temporal representations, which are then projected into the embedding space of LLM adapted by LoRA via the Q-Former. The LLM is prompted by the concatenation of these embeddings and the user query for temporal reasoning.
  • Figure 3: Architecture of T-Fuseformer. Each layer consists of unimodal attention and multimodal attention. Unimodal features are first refined via self-attention and then compressed via cross-attention with learnable queries. FUSION queries are introduced to attend to the compressed visual and audio features through self-attention and integrate multimodal information. Stacked layers progressively enhance temporal and cross-modal representations. The final FUSION queries serve as the temporally grounded representation to LLM.
  • Figure 4: Multi-turn Visual Understanding in Dialogue.
  • Figure 5: Temporal Localization and Reasoning.
  • ...and 3 more figures