Table of Contents
Fetching ...

MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models

Wenyi Hong, Yean Cheng, Zhuoyi Yang, Weihan Wang, Lefan Wang, Xiaotao Gu, Shiyu Huang, Yuxiao Dong, Jie Tang

TL;DR

MotionBench addresses the under-explored problem of fine-grained motion understanding in vision-language video models by providing a dense, motion-focused benchmark and a rigorous annotation pipeline. The paper analyzes existing video VLM compression strategies, introduces Through-Encoder Fusion (TE Fusion) to enable deeper temporal fusion under fixed decoder length, and demonstrates state-of-the-art performance on MotionBench and other benchmarks especially under high compression. Key findings include that current VLMs struggle with motion-level questions (often below 60% accuracy) and that higher frame rates and TE Fusion substantially improve motion comprehension, though room for improvement remains. The benchmark and TE Fusion offer a practical path to boost fine-grained motion perception in video understanding systems, with broad implications for real-world applications.

Abstract

In recent years, vision language models (VLMs) have made significant advancements in video understanding. However, a crucial capability - fine-grained motion comprehension - remains under-explored in current benchmarks. To address this gap, we propose MotionBench, a comprehensive evaluation benchmark designed to assess the fine-grained motion comprehension of video understanding models. MotionBench evaluates models' motion-level perception through six primary categories of motion-oriented question types and includes data collected from diverse sources, ensuring a broad representation of real-world video content. Experimental results reveal that existing VLMs perform poorly in understanding fine-grained motions. To enhance VLM's ability to perceive fine-grained motion within a limited sequence length of LLM, we conduct extensive experiments reviewing VLM architectures optimized for video feature compression and propose a novel and efficient Through-Encoder (TE) Fusion method. Experiments show that higher frame rate inputs and TE Fusion yield improvements in motion understanding, yet there is still substantial room for enhancement. Our benchmark aims to guide and motivate the development of more capable video understanding models, emphasizing the importance of fine-grained motion comprehension. Project page: https://motion-bench.github.io .

MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models

TL;DR

MotionBench addresses the under-explored problem of fine-grained motion understanding in vision-language video models by providing a dense, motion-focused benchmark and a rigorous annotation pipeline. The paper analyzes existing video VLM compression strategies, introduces Through-Encoder Fusion (TE Fusion) to enable deeper temporal fusion under fixed decoder length, and demonstrates state-of-the-art performance on MotionBench and other benchmarks especially under high compression. Key findings include that current VLMs struggle with motion-level questions (often below 60% accuracy) and that higher frame rates and TE Fusion substantially improve motion comprehension, though room for improvement remains. The benchmark and TE Fusion offer a practical path to boost fine-grained motion perception in video understanding systems, with broad implications for real-world applications.

Abstract

In recent years, vision language models (VLMs) have made significant advancements in video understanding. However, a crucial capability - fine-grained motion comprehension - remains under-explored in current benchmarks. To address this gap, we propose MotionBench, a comprehensive evaluation benchmark designed to assess the fine-grained motion comprehension of video understanding models. MotionBench evaluates models' motion-level perception through six primary categories of motion-oriented question types and includes data collected from diverse sources, ensuring a broad representation of real-world video content. Experimental results reveal that existing VLMs perform poorly in understanding fine-grained motions. To enhance VLM's ability to perceive fine-grained motion within a limited sequence length of LLM, we conduct extensive experiments reviewing VLM architectures optimized for video feature compression and propose a novel and efficient Through-Encoder (TE) Fusion method. Experiments show that higher frame rate inputs and TE Fusion yield improvements in motion understanding, yet there is still substantial room for enhancement. Our benchmark aims to guide and motivate the development of more capable video understanding models, emphasizing the importance of fine-grained motion comprehension. Project page: https://motion-bench.github.io .
Paper Structure (25 sections, 1 equation, 7 figures, 7 tables)

This paper contains 25 sections, 1 equation, 7 figures, 7 tables.

Figures (7)

  • Figure 1: State-of-the-art video understanding models struggle with basic motion-level perception. Compared to existing benchmarks, our proposed MotionBench focuses on assessing the model's Motion level perception capability, which is critical in understanding videos with fast and instant interactions and motions.
  • Figure 2: We propose MotionBench, a collection of manually curated multi-choice queries with video clips featuring dynamic changes from various scenes such as daily life and medical instructions. We devise six primary tasks to evaluate the capability of motion-level perception. Unlike previous story-level and event-level benchmarks, MotionBench is characterized by a significantly higher annotation density, allowing for the assessment of fine-grained motions.
  • Figure 3: Basic statistics of MotionBench.
  • Figure 4: Example of dynamic information annotation
  • Figure 5: Summarization of prevalent paradigms for video compression and our proposed Through-Encoder Fusion (TE Fusion). Here we only illustrate the part before the VLM decoder where temporal compression performs.
  • ...and 2 more figures