Table of Contents
Fetching ...

Flow4Agent: Long-form Video Understanding via Motion Prior from Optical Flow

Ruyang Liu, Shangkun Sun, Haoran Tang, Ge Li, Wei Gao

TL;DR

Flow4Agent introduces motion priors from optical flow to enhance long-form video understanding with two novel modules: Temporal Granularity Optimization (TGO) and Motion Token Pruning (MTP). TGO adaptively partitions videos into representative dynamic events using coarse flow and HSV-based analysis, then selects event frames via a cross-modal query with a p-value constraint. MTP refines intra-frame content by pruning low-signal tokens using fine-grained flow, camera-motion compensation, and saliency cues, retaining the most dynamic tokens. Across six long-video benchmarks, Flow4Agent achieves state-of-the-art results and demonstrates strong frame-efficiency and robustness across base models, highlighting the practical impact of motion priors for scalable, long-form video reasoning.

Abstract

Long-form video understanding has always been a challenging problem due to the significant redundancy in both temporal and spatial contents. This challenge is further exacerbated by the limited context length of Multimodal Large Language Models (MLLMs). To address this issue, many previous works have attempted to extract key video information, where the "key" is typically semantic-aware and heavily dependent on the CLIP model as prior. In this paper, we propose Flow4Agent, a novel framework that pioneeringly incorporates motion priors from optical flow to facilitate LLM-based long video understanding. Flow4Agent mitigates the redundancy in long videos at both temporal and spatial levels through two core modules: Temporal Granularity Optimization (TGO) adaptively refines framelevel hierarchies, which first leverages coarse flow priors to group similar visual contents and then applies semantic priors to filter out highly irrelevant scene information. Motion Token Pruning (MTP) further refines the intra-frame visual representations, pruning high-redundancy video tokens using fine-grained optical flow information. Extensive experiments demonstrate that our Flow4Agent outperforms existing methods across a wide range of video MLLM benchmarks, especially for hour-level video understanding tasks, achieving 64.7% on Video-MME, 71.4% on MLVU and 60.4% on LongVideoBench.

Flow4Agent: Long-form Video Understanding via Motion Prior from Optical Flow

TL;DR

Flow4Agent introduces motion priors from optical flow to enhance long-form video understanding with two novel modules: Temporal Granularity Optimization (TGO) and Motion Token Pruning (MTP). TGO adaptively partitions videos into representative dynamic events using coarse flow and HSV-based analysis, then selects event frames via a cross-modal query with a p-value constraint. MTP refines intra-frame content by pruning low-signal tokens using fine-grained flow, camera-motion compensation, and saliency cues, retaining the most dynamic tokens. Across six long-video benchmarks, Flow4Agent achieves state-of-the-art results and demonstrates strong frame-efficiency and robustness across base models, highlighting the practical impact of motion priors for scalable, long-form video reasoning.

Abstract

Long-form video understanding has always been a challenging problem due to the significant redundancy in both temporal and spatial contents. This challenge is further exacerbated by the limited context length of Multimodal Large Language Models (MLLMs). To address this issue, many previous works have attempted to extract key video information, where the "key" is typically semantic-aware and heavily dependent on the CLIP model as prior. In this paper, we propose Flow4Agent, a novel framework that pioneeringly incorporates motion priors from optical flow to facilitate LLM-based long video understanding. Flow4Agent mitigates the redundancy in long videos at both temporal and spatial levels through two core modules: Temporal Granularity Optimization (TGO) adaptively refines framelevel hierarchies, which first leverages coarse flow priors to group similar visual contents and then applies semantic priors to filter out highly irrelevant scene information. Motion Token Pruning (MTP) further refines the intra-frame visual representations, pruning high-redundancy video tokens using fine-grained optical flow information. Extensive experiments demonstrate that our Flow4Agent outperforms existing methods across a wide range of video MLLM benchmarks, especially for hour-level video understanding tasks, achieving 64.7% on Video-MME, 71.4% on MLVU and 60.4% on LongVideoBench.

Paper Structure

This paper contains 17 sections, 5 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Comparison between uniform sampling, dense sampling, and our proposed Flow4Agent.
  • Figure 2: Overview of the proposed Flow4Agent. TGO and MTP strategies are highlighted in the blue and yellow regions, respectively. The dashed line indicates the frames after the first stage in DES, and $\odot$ denotes the Hadamard product operator.
  • Figure 3: Performance comparison with and without Flow4Agent as the number of frames changes. Flow4Agent provides a greater improvement with fewer frames, while also achieving higher frame efficiency.
  • Figure 4: Visualizations of how Flow4Agent reduces redundancy.