Table of Contents
Fetching ...

Question-guided Visual Compression with Memory Feedback for Long-Term Video Understanding

Sosuke Yamao, Natsuki Miyahara, Yuankai Qi, Shun Takeuchi

Abstract

In the context of long-term video understanding with large multimodal models, many frameworks have been proposed. Although transformer-based visual compressors and memory-augmented approaches are often used to process long videos, they usually compress each frame independently and therefore fail to achieve strong performance on tasks that require understanding complete events, such as temporal ordering tasks in MLVU and VNBench. This motivates us to rethink the conventional one-way scheme from perception to memory, and instead establish a feedbackdriven process in which past visual contexts stored in the context memory can benefit ongoing perception. To this end, we propose Question-guided Visual Compression with Memory Feedback (QViC-MF), a framework for long-term video understanding. At its core is a Question-guided Multimodal Selective Attention (QMSA), which learns to preserve visual information related to the given question from both the current clip and the past related frames from the memory. The compressor and memory feedback work iteratively for each clip of the entire video. This simple yet effective design yields large performance gains on longterm video understanding tasks. Extensive experiments show that our method achieves significant improvement over current state-of-the-art methods by 6.1% on MLVU test, 8.3% on LVBench, 18.3% on VNBench Long, and 3.7% on VideoMME Long. The code will be released publicly.

Question-guided Visual Compression with Memory Feedback for Long-Term Video Understanding

Abstract

In the context of long-term video understanding with large multimodal models, many frameworks have been proposed. Although transformer-based visual compressors and memory-augmented approaches are often used to process long videos, they usually compress each frame independently and therefore fail to achieve strong performance on tasks that require understanding complete events, such as temporal ordering tasks in MLVU and VNBench. This motivates us to rethink the conventional one-way scheme from perception to memory, and instead establish a feedbackdriven process in which past visual contexts stored in the context memory can benefit ongoing perception. To this end, we propose Question-guided Visual Compression with Memory Feedback (QViC-MF), a framework for long-term video understanding. At its core is a Question-guided Multimodal Selective Attention (QMSA), which learns to preserve visual information related to the given question from both the current clip and the past related frames from the memory. The compressor and memory feedback work iteratively for each clip of the entire video. This simple yet effective design yields large performance gains on longterm video understanding tasks. Extensive experiments show that our method achieves significant improvement over current state-of-the-art methods by 6.1% on MLVU test, 8.3% on LVBench, 18.3% on VNBench Long, and 3.7% on VideoMME Long. The code will be released publicly.
Paper Structure (41 sections, 10 equations, 8 figures, 14 tables)

This paper contains 41 sections, 10 equations, 8 figures, 14 tables.

Figures (8)

  • Figure 1: (a) Conventional frameworks Li2023LLaMAVIDAIHe2024MALMMMLzhang2025llavaminiyamao2024iqvic compress each frame independently without feedback from memory to perception, resulting in limited temporal reasoning and contextual inconsistency. (b) The proposed Question-guided Visual Compression with Memory Feedback (QViC-MF) framework performs multi-frame compression guided by both the question and recalled memory, enabling feedback-driven perception that preserves temporal event completeness.
  • Figure 2: (a) Overview of the proposed Question-guided Visual Compression with Memory Feedback (QViC-MF) framework (Section \ref{['subsec_overall_framework']}). The video is processed sequentially in clip-wise steps indexed by $n$. The visual encoder and projector extract visual embeddings from the current clip $\mathcal{V}_{\mathrm{c},n}$ and recalled frames $\mathcal{V}_{\mathrm{r},n}$ from the context memory $\mathcal{B}_{n-1}$, which are fed into a visual compressor equipped with Question-guided Multimodal Selective Attention (QMSA). The compressor transforms the context seed embeddings $\mathcal{E}_{\mathrm{c}_0}$ into context embeddings $\mathcal{E}_{\mathrm{c},n}$ by selectively compressing visual features conditioned on the question. The context memory $\mathcal{B}_n$ stores the context embeddings with their frame indices $\mathcal{S}_n$ and relevance scores $\mathcal{R}_n$, incrementally appending new entries and pruning low-relevance ones to form the updated memory. From this memory, the top-$K_{\mathrm{r}}$ most relevant frame indices are retrieved to form the recalled frames $\mathcal{V}_{\mathrm{r},n+1}$ used for the next clip. Finally, the LLM decoder generates the text answer based on the updated memory. Trainable components are marked with flame icons. (b) Illustration of QMSA (Section \ref{['subsec_qmsa']}). QMSA regulates multimodal attention through Mask, Block, and Guide operations, enabling frame-wise compression while preserving cross-frame context and focusing on question-related features. Here, Ctx2Vis and Ctx2Txt denote context-to-visual and context-to-text attention, respectively.
  • Figure 3: (a) Without our blocking context-to-text (Ctx2Txt), i.e., when using a naive variant of our visual compressor with standard causal masking in the self-attention module, textual information from the question may leak into the context embeddings, causing compression hallucination, where the embeddings no longer represent pure visual features. (b) Impact of guiding context-to-visual (Ctx2Vis). Without guiding Ctx2Vis, the model exhibits question-insensitive attention, failing to adapt its focus to the question. Incorporating guiding Ctx2Vis enables question-adaptive attention, allowing the model to attend selectively to regions relevant to the queried content.
  • Figure 4: Comparison of the proposed visual compressor equipped with QMSA against single-frame and average pooling baseline compressors. "Single-frame" denotes frame-by-frame compression without temporal integration. Each point shows the accuracy keep rate (in %) with the actual score in parentheses. The horizontal axis represents the compression rate, which is the ratio of visual tokens after compression to those before. To isolate the effect of the visual compressor, context memory and memory feedback are disabled, and all videos are uniformly sampled to 64 frames.
  • Figure 5: Case study example from MLVU, illustrating how QViC-MF utilizes context memory to answer long-term reasoning questions. The example corresponds to an Action Order (AO) task question, where the model predicts the correct sequence of events over a long-term video. Input frames corresponding to stored context memory entries are shown, and answer-critical frames are highlighted in red. The relevance score plot indicates the relevance score of each context memory entry over time.
  • ...and 3 more figures