Table of Contents
Fetching ...

KPM-Bench: A Kinematic Parsing Motion Benchmark for Fine-grained Motion-centric Video Understanding

Boda Lin, Yongjie Zhu, Xiaocheng Gong, Wenyu Qin, Meng Wang

TL;DR

KPM-Bench tackles the dual challenges of lacking fine-grained motion descriptions and prevalent hallucinations in motion-centric video captioning. It introduces an automated annotation pipeline that fuses kinematic motion computation with structured linguistic parsing to produce PaMoR-driven, dense captions and builds a large-scale benchmark (KPM-Bench) comprising KPM-Cap, KPM-QA, and KPM-HA. To curb motion hallucinations, the authors propose Motion Parsing and Extraction (MoPE) and integrate it into a GRPO-based post-training framework, accompanied by a precise hallucination metric. Experimental results show improved content quality and task accuracy on motion-centric benchmarks, with MoPE effectively reducing hallucinations while maintaining overall linguistic quality, highlighting the dataset and methods' potential for reliable fine-grained video understanding.

Abstract

Despite recent advancements, video captioning models still face significant limitations in accurately describing fine-grained motion details and suffer from severe hallucination issues. These challenges become particularly prominent when generating captions for motion-centric videos, where precise depiction of intricate movements and limb dynamics is crucial yet often neglected. To alleviate this gap, we introduce an automated annotation pipeline that integrates kinematic-based motion computation with linguistic parsing, enabling detailed decomposition and description of complex human motions. Based on this pipeline, we construct and release the Kinematic Parsing Motion Benchmark (KPM-Bench), a novel open-source dataset designed to facilitate fine-grained motion understanding. KPM-Bench consists of (i) fine-grained video-caption pairs that comprehensively illustrate limb-level dynamics in complex actions, (ii) diverse and challenging question-answer pairs focusing specifically on motion understanding, and (iii) a meticulously curated evaluation set specifically designed to assess hallucination phenomena associated with motion descriptions. Furthermore, to address hallucination issues systematically, we propose the linguistically grounded Motion Parsing and Extraction (MoPE) algorithm, capable of accurately extracting motion-specific attributes directly from textual captions. Leveraging MoPE, we introduce a precise hallucination evaluation metric that functions independently of large-scale vision-language or language-only models. By integrating MoPE into the GRPO post-training framework, we effectively mitigate hallucination problems, significantly improving the reliability of motion-centric video captioning models.

KPM-Bench: A Kinematic Parsing Motion Benchmark for Fine-grained Motion-centric Video Understanding

TL;DR

KPM-Bench tackles the dual challenges of lacking fine-grained motion descriptions and prevalent hallucinations in motion-centric video captioning. It introduces an automated annotation pipeline that fuses kinematic motion computation with structured linguistic parsing to produce PaMoR-driven, dense captions and builds a large-scale benchmark (KPM-Bench) comprising KPM-Cap, KPM-QA, and KPM-HA. To curb motion hallucinations, the authors propose Motion Parsing and Extraction (MoPE) and integrate it into a GRPO-based post-training framework, accompanied by a precise hallucination metric. Experimental results show improved content quality and task accuracy on motion-centric benchmarks, with MoPE effectively reducing hallucinations while maintaining overall linguistic quality, highlighting the dataset and methods' potential for reliable fine-grained video understanding.

Abstract

Despite recent advancements, video captioning models still face significant limitations in accurately describing fine-grained motion details and suffer from severe hallucination issues. These challenges become particularly prominent when generating captions for motion-centric videos, where precise depiction of intricate movements and limb dynamics is crucial yet often neglected. To alleviate this gap, we introduce an automated annotation pipeline that integrates kinematic-based motion computation with linguistic parsing, enabling detailed decomposition and description of complex human motions. Based on this pipeline, we construct and release the Kinematic Parsing Motion Benchmark (KPM-Bench), a novel open-source dataset designed to facilitate fine-grained motion understanding. KPM-Bench consists of (i) fine-grained video-caption pairs that comprehensively illustrate limb-level dynamics in complex actions, (ii) diverse and challenging question-answer pairs focusing specifically on motion understanding, and (iii) a meticulously curated evaluation set specifically designed to assess hallucination phenomena associated with motion descriptions. Furthermore, to address hallucination issues systematically, we propose the linguistically grounded Motion Parsing and Extraction (MoPE) algorithm, capable of accurately extracting motion-specific attributes directly from textual captions. Leveraging MoPE, we introduce a precise hallucination evaluation metric that functions independently of large-scale vision-language or language-only models. By integrating MoPE into the GRPO post-training framework, we effectively mitigate hallucination problems, significantly improving the reliability of motion-centric video captioning models.
Paper Structure (54 sections, 10 equations, 19 figures, 7 tables, 1 algorithm)

This paper contains 54 sections, 10 equations, 19 figures, 7 tables, 1 algorithm.

Figures (19)

  • Figure 1: Details of KPM-Bench. In contrast to previous approaches that rely solely on manual annotation of fine-grained motion-centric captions or directly generate annotations using GPT, our method enhances the automatic annotation process by first performing pose estimation on the video and then computing relevant physical attributes. This integration of motion analysis enables the generation of motion captions that can effectively unfold the detailed progression of complex actions.
  • Figure 2: The statistics details of KPM-Bench. The KPM-Bench contains three subsets: KPM-Cap, KPM-QA, and KPM-HA. In addition to visualizing the cases and the category distribution of each subset, we also show the annotation density of the KPM-Cap, the option distribution statistics of the KPM-QA, and the motion type statistics of the KPM-HA.
  • Figure 3: The pipeline of dataset construction. First, the video frames are processed through 3D pose estimation and kinematic calculation to obtain kinematic-related numerical attributes. Then, these kinematic attributes and the video frames are fed into GPT to obtain PaMoR-Tuple format annotations, which are then further expanded into dense captions.
  • Figure 4: Comparison cases of caption annotation performance using the our method on MoVid chen2024motionllm and Dream-1K yuan2025tarsier2.
  • Figure 5: The human motion (subfigure $1$$\rightarrow$ subfigure $4$) can be decomposing into Position Translation (subfigure $1$$\rightarrow$ subfigure $2$) and Postural Transformation (subfigure $1$$\rightarrow$ subfigure $3$).
  • ...and 14 more figures