ReMoRa: Multimodal Large Language Model based on Refined Motion Representation for Long-Video Understanding
Daichi Yashima, Shuhei Kurita, Yusuke Oda, Komei Sugiura
TL;DR
ReMoRa tackles long-form video understanding by moving from dense RGB frames to compressed-domain representations, addressing the $O(n^2)$ complexity of self-attention on long sequences. It retains a small set of I-frames for appearance and encodes temporal dynamics with motion vectors, then refines them with the Refined Motion Representation (RMR) to dense, fine-grained motion cues. A Hierarchical Motion State Space (HMSS) module models temporal dependencies in linear time $O(n)$ by exploiting the codec's GOP structure, enabling scalable long-video reasoning. The method yields state-of-the-art results across long-video benchmarks such as LongVideoBench, NExT-QA, and MLVU, and shows competitive performance on VideoMME and Perception Test, with efficient compute. This work supports practical deployment of video-language models for long narratives.
Abstract
While multimodal large language models (MLLMs) have shown remarkable success across a wide range of tasks, long-form video understanding remains a significant challenge. In this study, we focus on video understanding by MLLMs. This task is challenging because processing a full stream of RGB frames is computationally intractable and highly redundant, as self-attention have quadratic complexity with sequence length. In this paper, we propose ReMoRa, a video MLLM that processes videos by operating directly on their compressed representations. A sparse set of RGB keyframes is retained for appearance, while temporal dynamics are encoded as a motion representation, removing the need for sequential RGB frames. These motion representations act as a compact proxy for optical flow, capturing temporal dynamics without full frame decoding. To refine the noise and low fidelity of block-based motions, we introduce a module to denoise and generate a fine-grained motion representation. Furthermore, our model compresses these features in a way that scales linearly with sequence length. We demonstrate the effectiveness of ReMoRa through extensive experiments across a comprehensive suite of long-video understanding benchmarks. ReMoRa outperformed baseline methods on multiple challenging benchmarks, including LongVideoBench, NExT-QA, and MLVU.
