Table of Contents
Fetching ...

ReWind: Understanding Long Videos with Instructed Learnable Memory

Anxhelo Diko, Tinghuai Wang, Wassim Swaileh, Shiyan Sun, Ioannis Patras

TL;DR

ReWind tackles the core challenge of long-video understanding by introducing a memory-based vision-language framework with a novel read-perceive-write cycle that builds a coherent temporal representation while scaling linearly with token count. A memory-guided adaptive frame selection (DFS) enriches memory with high-resolution spatial details for instruction-relevant moments, which are then integrated with an LLM to produce answers. The approach yields significant gains in long-video VQA and temporal grounding benchmarks, while also performing well on short videos, and maintains practical memory usage suitable for standard GPUs. Overall, ReWind demonstrates that selective memory encoding combined with targeted spatial enrichment and LLM reasoning can achieve robust, efficient long-video understanding with strong temporal fidelity.

Abstract

Vision-Language Models (VLMs) are crucial for applications requiring integrated understanding textual and visual information. However, existing VLMs struggle with long videos due to computational inefficiency, memory limitations, and difficulties in maintaining coherent understanding across extended sequences. To address these challenges, we introduce ReWind, a novel memory-based VLM designed for efficient long video understanding while preserving temporal fidelity. ReWind operates in a two-stage framework. In the first stage, ReWind maintains a dynamic learnable memory module with a novel \textbf{read-perceive-write} cycle that stores and updates instruction-relevant visual information as the video unfolds. This module utilizes learnable queries and cross-attentions between memory contents and the input stream, ensuring low memory requirements by scaling linearly with the number of tokens. In the second stage, we propose an adaptive frame selection mechanism guided by the memory content to identify instruction-relevant key moments. It enriches the memory representations with detailed spatial information by selecting a few high-resolution frames, which are then combined with the memory contents and fed into a Large Language Model (LLM) to generate the final answer. We empirically demonstrate ReWind's superior performance in visual question answering (VQA) and temporal grounding tasks, surpassing previous methods on long video benchmarks. Notably, ReWind achieves a +13\% score gain and a +12\% accuracy improvement on the MovieChat-1K VQA dataset and an +8\% mIoU increase on Charades-STA for temporal grounding.

ReWind: Understanding Long Videos with Instructed Learnable Memory

TL;DR

ReWind tackles the core challenge of long-video understanding by introducing a memory-based vision-language framework with a novel read-perceive-write cycle that builds a coherent temporal representation while scaling linearly with token count. A memory-guided adaptive frame selection (DFS) enriches memory with high-resolution spatial details for instruction-relevant moments, which are then integrated with an LLM to produce answers. The approach yields significant gains in long-video VQA and temporal grounding benchmarks, while also performing well on short videos, and maintains practical memory usage suitable for standard GPUs. Overall, ReWind demonstrates that selective memory encoding combined with targeted spatial enrichment and LLM reasoning can achieve robust, efficient long-video understanding with strong temporal fidelity.

Abstract

Vision-Language Models (VLMs) are crucial for applications requiring integrated understanding textual and visual information. However, existing VLMs struggle with long videos due to computational inefficiency, memory limitations, and difficulties in maintaining coherent understanding across extended sequences. To address these challenges, we introduce ReWind, a novel memory-based VLM designed for efficient long video understanding while preserving temporal fidelity. ReWind operates in a two-stage framework. In the first stage, ReWind maintains a dynamic learnable memory module with a novel \textbf{read-perceive-write} cycle that stores and updates instruction-relevant visual information as the video unfolds. This module utilizes learnable queries and cross-attentions between memory contents and the input stream, ensuring low memory requirements by scaling linearly with the number of tokens. In the second stage, we propose an adaptive frame selection mechanism guided by the memory content to identify instruction-relevant key moments. It enriches the memory representations with detailed spatial information by selecting a few high-resolution frames, which are then combined with the memory contents and fed into a Large Language Model (LLM) to generate the final answer. We empirically demonstrate ReWind's superior performance in visual question answering (VQA) and temporal grounding tasks, surpassing previous methods on long video benchmarks. Notably, ReWind achieves a +13\% score gain and a +12\% accuracy improvement on the MovieChat-1K VQA dataset and an +8\% mIoU increase on Charades-STA for temporal grounding.

Paper Structure

This paper contains 21 sections, 2 equations, 5 figures, 9 tables.

Figures (5)

  • Figure 1: ReWind is a memory-based VLM framework designed for long video understanding (10+ minutes), specialized in VQA and temporal grounding. The highlighted frames are selected from ReWind's dynamic frame selection mechanism.
  • Figure 2: ReWind's VLM architecture for long video processing is illustrated in (a). It employs a two-stage processing scheme. In Stage 1 (black arrows), ReWind sequentially processes each video sub-clip using a visual encoder and a text-conditioned perceiver layer supported by a learnable memory module. This module performs read-and-write operations to ensure efficient information storage and maintain temporal coherence in a novel read-perceive-write cycle. In Stage 2 (green arrows), ReWind utilizes a dynamic frame selection (DFS) mechanism to incorporate detailed spatial information for key moments. Finally (red arrow), the memory content, selected frames, and user instruction are combined to form the input for the language model. In (b), the perceiver layer with learnable queries and text-conditioned visual features for instruction-guided encoding.
  • Figure 3: Rewind's read-perceive-write simplified workflow.
  • Figure 4: Ablation study on ReWind's performance and memory requirements in MovieChat-1K test set for different numbers of input frames, ranging from 64 to 1024, and 16-bit precision.
  • Figure 5: Qualitative result on VQA. We input ReWind with the illustrated video of +4 minutes and make two types of questions regarding the video content. On the first answer, we showcase ReWind's ability to understand the extended context and at the same time highlight in red the hallucination produced by it. In the second scenario, we highlight ReWind's ability to focus on different aspects of the video by matching some of the frames selected from DFS for the given scenario and the corresponding details on the generated answer.