Table of Contents
Fetching ...

FAVOR-Bench: A Comprehensive Benchmark for Fine-Grained Video Motion Understanding

Chongjun Tu, Lin Zhang, Pengtao Chen, Peng Ye, Xianfang Zeng, Wei Cheng, Gang Yu, Tao Chen

TL;DR

FAVOR-Bench introduces a fine-grained benchmark for video motion understanding with 1,776 videos and 8,184 close-ended QA pairs across six motion-centric tasks, plus open-ended evaluation via GPT-assisted and a novel LLM-free framework. It reveals notable gaps in current MLLMs' ability to capture detailed temporal dynamics and ego-centric motions. To close this gap, FAVOR-Train provides 17,152 annotated videos for supervised fine-tuning, which yields consistent improvements on FAVOR-Bench and related motion benchmarks. Together, FAVOR-Bench and FAVOR-Train offer a comprehensive platform for evaluating and advancing fine-grained video motion comprehension in multimodal models.

Abstract

Multimodal Large Language Models (MLLMs) have shown remarkable capabilities in video content understanding but still struggle with fine-grained motion comprehension. To comprehensively assess the motion understanding ability of existing MLLMs, we introduce FAVOR-Bench, comprising 1,776 videos with structured manual annotations of various motions. Our benchmark includes both close-ended and open-ended tasks. For close-ended evaluation, we carefully design 8,184 multiple-choice question-answer pairs spanning six distinct sub-tasks. For open-ended evaluation, we develop both a novel cost-efficient LLM-free and a GPT-assisted caption assessment method, where the former can enhance benchmarking interpretability and reproducibility. Comprehensive experiments with 21 state-of-the-art MLLMs reveal significant limitations in their ability to comprehend and describe detailed temporal dynamics in video motions. To alleviate this limitation, we further build FAVOR-Train, a dataset consisting of 17,152 videos with fine-grained motion annotations. The results of finetuning Qwen2.5-VL on FAVOR-Train yield consistent improvements on motion-related tasks of TVBench, MotionBench and our FAVOR-Bench. Comprehensive assessment results demonstrate that the proposed FAVOR-Bench and FAVOR-Train provide valuable tools to the community for developing more powerful video understanding models. Project page: \href{https://favor-bench.github.io/}{https://favor-bench.github.io/}.

FAVOR-Bench: A Comprehensive Benchmark for Fine-Grained Video Motion Understanding

TL;DR

FAVOR-Bench introduces a fine-grained benchmark for video motion understanding with 1,776 videos and 8,184 close-ended QA pairs across six motion-centric tasks, plus open-ended evaluation via GPT-assisted and a novel LLM-free framework. It reveals notable gaps in current MLLMs' ability to capture detailed temporal dynamics and ego-centric motions. To close this gap, FAVOR-Train provides 17,152 annotated videos for supervised fine-tuning, which yields consistent improvements on FAVOR-Bench and related motion benchmarks. Together, FAVOR-Bench and FAVOR-Train offer a comprehensive platform for evaluating and advancing fine-grained video motion comprehension in multimodal models.

Abstract

Multimodal Large Language Models (MLLMs) have shown remarkable capabilities in video content understanding but still struggle with fine-grained motion comprehension. To comprehensively assess the motion understanding ability of existing MLLMs, we introduce FAVOR-Bench, comprising 1,776 videos with structured manual annotations of various motions. Our benchmark includes both close-ended and open-ended tasks. For close-ended evaluation, we carefully design 8,184 multiple-choice question-answer pairs spanning six distinct sub-tasks. For open-ended evaluation, we develop both a novel cost-efficient LLM-free and a GPT-assisted caption assessment method, where the former can enhance benchmarking interpretability and reproducibility. Comprehensive experiments with 21 state-of-the-art MLLMs reveal significant limitations in their ability to comprehend and describe detailed temporal dynamics in video motions. To alleviate this limitation, we further build FAVOR-Train, a dataset consisting of 17,152 videos with fine-grained motion annotations. The results of finetuning Qwen2.5-VL on FAVOR-Train yield consistent improvements on motion-related tasks of TVBench, MotionBench and our FAVOR-Bench. Comprehensive assessment results demonstrate that the proposed FAVOR-Bench and FAVOR-Train provide valuable tools to the community for developing more powerful video understanding models. Project page: \href{https://favor-bench.github.io/}{https://favor-bench.github.io/}.

Paper Structure

This paper contains 26 sections, 2 equations, 11 figures, 12 tables.

Figures (11)

  • Figure 1: Illustration of motion understanding capabilities of proprietary and open-source MLLMs. Both models correctly answer the coarse-grained summarization question (Task 1), but fail to resolve fine-grained action detail question (Task 2). For the open-ended description task (Task 3), despite required to focus on temporal dynamics, the responses emphasize static content, and the motion descriptions are either coarse-grained or contain errors.
  • Figure 2: Data statistics of FAVOR-Bench. Left: Task type distribution across close-ended and open-ended evaluation in FAVOR-Bench. Middle: Distribution of motion sequence length per video. Right: The word cloud statistics of motion vocabularies in FAVOR-Bench.
  • Figure 3: Overview of evaluation tasks. FAVOR-Bench comprises close-ended and open-ended evaluations. The close-ended evaluation is composed of six tasks, focusing on different aspects of fine-grained motion understanding. The open-ended evaluation comprises a GPT-assisted evaluation and a novel LLM-free framework. In the GPT-assisted evaluation, model responses are directly compared with manual captions. The LLM-free framework parses structured motion elements from responses and compares them with the structured annotations.
  • Figure S1: More data statistics of FAVOR-Bench. Left: Index distribution of correct answers for the close-ended tasks. For example, "(1)" indicates that the correct option is ranked first. Middle: Video duration distribution of FAVOR-Bench. Right: Question number distribution for videos of FAVOR-Bench.
  • Figure S2: Statistics of motion words with the highest frequency in FAVOR-Bench.
  • ...and 6 more figures