Table of Contents
Fetching ...

HierarQ: Task-Aware Hierarchical Q-Former for Enhanced Video Understanding

Shehreen Azad, Vibhav Vineet, Yogesh Singh Rawat

TL;DR

HierarQ tackles the context-length bottleneck in multimodal large language models for long videos by introducing a task-aware, hierarchical querying transformer. It employs a two-stream, language-guided feature modulator to separately capture short-term entity details and long-term scene interactions, each backed by dedicated memory banks that are updated with FIFO and Memory Bank Compression strategies. The HierarQ module hierarchically fuses entity- and scene-level information and feeds a projected representation to a frozen LLM (finetuned with LoRA) to generate final outputs, enabling auto-regressive, frame-by-frame video understanding without frame sampling. Across 10 benchmarks spanning understanding, question answering, and captioning, HierarQ achieves state-of-the-art or competitive performance, demonstrating robust long-context video comprehension with efficient memory usage and scalable computation.

Abstract

Despite advancements in multimodal large language models (MLLMs), current approaches struggle in medium-to-long video understanding due to frame and context length limitations. As a result, these models often depend on frame sampling, which risks missing key information over time and lacks task-specific relevance. To address these challenges, we introduce HierarQ, a task-aware hierarchical Q-Former based framework that sequentially processes frames to bypass the need for frame sampling, while avoiding LLM's context length limitations. We introduce a lightweight two-stream language-guided feature modulator to incorporate task awareness in video understanding, with the entity stream capturing frame-level object information within a short context and the scene stream identifying their broader interactions over longer period of time. Each stream is supported by dedicated memory banks which enables our proposed Hierachical Querying transformer (HierarQ) to effectively capture short and long-term context. Extensive evaluations on 10 video benchmarks across video understanding, question answering, and captioning tasks demonstrate HierarQ's state-of-the-art performance across most datasets, proving its robustness and efficiency for comprehensive video analysis.

HierarQ: Task-Aware Hierarchical Q-Former for Enhanced Video Understanding

TL;DR

HierarQ tackles the context-length bottleneck in multimodal large language models for long videos by introducing a task-aware, hierarchical querying transformer. It employs a two-stream, language-guided feature modulator to separately capture short-term entity details and long-term scene interactions, each backed by dedicated memory banks that are updated with FIFO and Memory Bank Compression strategies. The HierarQ module hierarchically fuses entity- and scene-level information and feeds a projected representation to a frozen LLM (finetuned with LoRA) to generate final outputs, enabling auto-regressive, frame-by-frame video understanding without frame sampling. Across 10 benchmarks spanning understanding, question answering, and captioning, HierarQ achieves state-of-the-art or competitive performance, demonstrating robust long-context video comprehension with efficient memory usage and scalable computation.

Abstract

Despite advancements in multimodal large language models (MLLMs), current approaches struggle in medium-to-long video understanding due to frame and context length limitations. As a result, these models often depend on frame sampling, which risks missing key information over time and lacks task-specific relevance. To address these challenges, we introduce HierarQ, a task-aware hierarchical Q-Former based framework that sequentially processes frames to bypass the need for frame sampling, while avoiding LLM's context length limitations. We introduce a lightweight two-stream language-guided feature modulator to incorporate task awareness in video understanding, with the entity stream capturing frame-level object information within a short context and the scene stream identifying their broader interactions over longer period of time. Each stream is supported by dedicated memory banks which enables our proposed Hierachical Querying transformer (HierarQ) to effectively capture short and long-term context. Extensive evaluations on 10 video benchmarks across video understanding, question answering, and captioning tasks demonstrate HierarQ's state-of-the-art performance across most datasets, proving its robustness and efficiency for comprehensive video analysis.

Paper Structure

This paper contains 25 sections, 9 equations, 16 figures, 16 tables.

Figures (16)

  • Figure 1: Effectiveness of HierarQ in capturing task-relevant information. HierarQ adaptively focuses on task-relevant video segments, achieving a task-aware, comprehensive understanding. Here, color-coded frames are shown to demonstrate how entity-focused information complements the broader prompt-relevant context, enhancing overall video relevance and understanding.
  • Figure 2: Overview of our framework that sequentially processes video frames, modulating task-relevant entity and scene features with a two-stream feature modulator. The proposed HierarQ (Hierarchical Q-Former) with dedicated memory banks integrates these features, producing a refined understanding that is passed to an LLM for the final response. The flame and snowflake icons respectively denote trainable and frozen parameters.
  • Figure 3: Overview of HierarQ (Hierarchical Querying transformer). It models the hierarchical relationship between then Entity-level Q-Former and Scene-level Q-Former, using dedicated memory banks to integrate short-term details with long-term context for enhanced video understanding.
  • Figure 4: Impact of memory bank length. When one memory length is varied, the other one remains at a fixed size of $10$.
  • Figure 5: Impact of video length. Here, across both relation and speak category of LVU dataset, MA-LMM performance decreases as video length increase, however, HierarQ achieves fairly stable performance showing the effectiveness of our method across increasing video length.
  • ...and 11 more figures