Table of Contents
Fetching ...

ReMoRa: Multimodal Large Language Model based on Refined Motion Representation for Long-Video Understanding

Daichi Yashima, Shuhei Kurita, Yusuke Oda, Komei Sugiura

TL;DR

ReMoRa tackles long-form video understanding by moving from dense RGB frames to compressed-domain representations, addressing the $O(n^2)$ complexity of self-attention on long sequences. It retains a small set of I-frames for appearance and encodes temporal dynamics with motion vectors, then refines them with the Refined Motion Representation (RMR) to dense, fine-grained motion cues. A Hierarchical Motion State Space (HMSS) module models temporal dependencies in linear time $O(n)$ by exploiting the codec's GOP structure, enabling scalable long-video reasoning. The method yields state-of-the-art results across long-video benchmarks such as LongVideoBench, NExT-QA, and MLVU, and shows competitive performance on VideoMME and Perception Test, with efficient compute. This work supports practical deployment of video-language models for long narratives.

Abstract

While multimodal large language models (MLLMs) have shown remarkable success across a wide range of tasks, long-form video understanding remains a significant challenge. In this study, we focus on video understanding by MLLMs. This task is challenging because processing a full stream of RGB frames is computationally intractable and highly redundant, as self-attention have quadratic complexity with sequence length. In this paper, we propose ReMoRa, a video MLLM that processes videos by operating directly on their compressed representations. A sparse set of RGB keyframes is retained for appearance, while temporal dynamics are encoded as a motion representation, removing the need for sequential RGB frames. These motion representations act as a compact proxy for optical flow, capturing temporal dynamics without full frame decoding. To refine the noise and low fidelity of block-based motions, we introduce a module to denoise and generate a fine-grained motion representation. Furthermore, our model compresses these features in a way that scales linearly with sequence length. We demonstrate the effectiveness of ReMoRa through extensive experiments across a comprehensive suite of long-video understanding benchmarks. ReMoRa outperformed baseline methods on multiple challenging benchmarks, including LongVideoBench, NExT-QA, and MLVU.

ReMoRa: Multimodal Large Language Model based on Refined Motion Representation for Long-Video Understanding

TL;DR

ReMoRa tackles long-form video understanding by moving from dense RGB frames to compressed-domain representations, addressing the complexity of self-attention on long sequences. It retains a small set of I-frames for appearance and encodes temporal dynamics with motion vectors, then refines them with the Refined Motion Representation (RMR) to dense, fine-grained motion cues. A Hierarchical Motion State Space (HMSS) module models temporal dependencies in linear time by exploiting the codec's GOP structure, enabling scalable long-video reasoning. The method yields state-of-the-art results across long-video benchmarks such as LongVideoBench, NExT-QA, and MLVU, and shows competitive performance on VideoMME and Perception Test, with efficient compute. This work supports practical deployment of video-language models for long narratives.

Abstract

While multimodal large language models (MLLMs) have shown remarkable success across a wide range of tasks, long-form video understanding remains a significant challenge. In this study, we focus on video understanding by MLLMs. This task is challenging because processing a full stream of RGB frames is computationally intractable and highly redundant, as self-attention have quadratic complexity with sequence length. In this paper, we propose ReMoRa, a video MLLM that processes videos by operating directly on their compressed representations. A sparse set of RGB keyframes is retained for appearance, while temporal dynamics are encoded as a motion representation, removing the need for sequential RGB frames. These motion representations act as a compact proxy for optical flow, capturing temporal dynamics without full frame decoding. To refine the noise and low fidelity of block-based motions, we introduce a module to denoise and generate a fine-grained motion representation. Furthermore, our model compresses these features in a way that scales linearly with sequence length. We demonstrate the effectiveness of ReMoRa through extensive experiments across a comprehensive suite of long-video understanding benchmarks. ReMoRa outperformed baseline methods on multiple challenging benchmarks, including LongVideoBench, NExT-QA, and MLVU.
Paper Structure (42 sections, 10 equations, 5 figures, 8 tables)

This paper contains 42 sections, 10 equations, 5 figures, 8 tables.

Figures (5)

  • Figure 1: Overview of ReMoRa. Our method utilizes compressed video representations, which naturally separates each video into keyframes and compressed inter-frame redundancies. From these, we extract motions that are lightweight but noisy and coarse. Our model then refines these motions into clean and fine-grained representations that preserve efficiency while approaching the fidelity of dense optical flow.
  • Figure 2: Architecture of ReMoRa: The model operates directly in the compressed video representation for long-video understanding. (a) It consists of an image encoder, the Refined Motion Representation (RMR) module, the Hierarchical Motion State Space (HMSS) Module, and a pretrained LLM. Each clip is decomposed into group of pictures (GOPs) with a single I-frame and several P/B frames represented by motion vectors. The image encoder (Enc.) extracts patch embeddings from I-frames, while the RMR module converts coarse motion vectors into dense, high-fidelity representations. (b) The HMSS module fuses the refined motions and appearance features within each GOP and models long-range dependencies across GOPs through a state space model, enabling linear-time temporal reasoning before alignment with the LLM.
  • Figure 3: Qualitative comparison between ReMoRa and LLaVA-Video on NExT-QA. In both examples, ReMoRa correctly answers questions about fine-grained, temporally contextualized human actions and object motions, while LLaVA-Video fails, highlighting ReMoRa's superior use of motion cues for fine-grained action understanding.
  • Figure 4: Further qualitative comparison between ReMoRa and LLaVA-Video on LongVideoBench. In both examples, ReMoRa correctly answers questions that require integrating spatial details with long-range temporal understanding, such as tracking how the scene and objects change over time and consistently identifying the person involved in the activity, while the baseline model fails.
  • Figure 5: Example of scene-aware video preprocessing. Frames 0 and 18 are scene-adaptive I-frames used as keyframes, and the remaining frames are P/B-frames with overlaid codec motion vectors.