Table of Contents
Fetching ...

Towards Fine-Grained Human Motion Video Captioning

Guorui Song, Guocun Wang, Zhe Huang, Jing Lin, Xuefei Zhe, Jian Li, Haoqian Wang

TL;DR

This work tackles the difficulty of fine-grained human motion captioning in videos by introducing the Motion-Augmented Caption Model (M-ACM) with Motion Synergetic Decoding (MSD), a dual-pathway architecture that fuses standard visual features with SMPL-based motion representations to reduce hallucinations and improve semantic and spatial fidelity. It also introduces the Human Motion Insight (HMI) dataset (115K videos, 1,031K QA pairs) and HMI-Bench, a motion-centric benchmark for evaluating captioning and QA across detailed movement dimensions. MSD integrates five L_i components to compute a joint synergy score S(y_T) used for token selection and grounding in both modalities, leading to stronger motion descriptions and fewer errors. Experiments on two foundation models (Qwen2 7B and Llama3 8B) demonstrate that M-ACM achieves superior performance across standard caption metrics and motion-understanding dimensions, indicating significant practical value for precise human motion understanding in video captioning tasks.

Abstract

Generating accurate descriptions of human actions in videos remains a challenging task for video captioning models. Existing approaches often struggle to capture fine-grained motion details, resulting in vague or semantically inconsistent captions. In this work, we introduce the Motion-Augmented Caption Model (M-ACM), a novel generative framework that enhances caption quality by incorporating motion-aware decoding. At its core, M-ACM leverages motion representations derived from human mesh recovery to explicitly highlight human body dynamics, thereby reducing hallucinations and improving both semantic fidelity and spatial alignment in the generated captions. To support research in this area, we present the Human Motion Insight (HMI) Dataset, comprising 115K video-description pairs focused on human movement, along with HMI-Bench, a dedicated benchmark for evaluating motion-focused video captioning. Experimental results demonstrate that M-ACM significantly outperforms previous methods in accurately describing complex human motions and subtle temporal variations, setting a new standard for motion-centric video captioning.

Towards Fine-Grained Human Motion Video Captioning

TL;DR

This work tackles the difficulty of fine-grained human motion captioning in videos by introducing the Motion-Augmented Caption Model (M-ACM) with Motion Synergetic Decoding (MSD), a dual-pathway architecture that fuses standard visual features with SMPL-based motion representations to reduce hallucinations and improve semantic and spatial fidelity. It also introduces the Human Motion Insight (HMI) dataset (115K videos, 1,031K QA pairs) and HMI-Bench, a motion-centric benchmark for evaluating captioning and QA across detailed movement dimensions. MSD integrates five L_i components to compute a joint synergy score S(y_T) used for token selection and grounding in both modalities, leading to stronger motion descriptions and fewer errors. Experiments on two foundation models (Qwen2 7B and Llama3 8B) demonstrate that M-ACM achieves superior performance across standard caption metrics and motion-understanding dimensions, indicating significant practical value for precise human motion understanding in video captioning tasks.

Abstract

Generating accurate descriptions of human actions in videos remains a challenging task for video captioning models. Existing approaches often struggle to capture fine-grained motion details, resulting in vague or semantically inconsistent captions. In this work, we introduce the Motion-Augmented Caption Model (M-ACM), a novel generative framework that enhances caption quality by incorporating motion-aware decoding. At its core, M-ACM leverages motion representations derived from human mesh recovery to explicitly highlight human body dynamics, thereby reducing hallucinations and improving both semantic fidelity and spatial alignment in the generated captions. To support research in this area, we present the Human Motion Insight (HMI) Dataset, comprising 115K video-description pairs focused on human movement, along with HMI-Bench, a dedicated benchmark for evaluating motion-focused video captioning. Experimental results demonstrate that M-ACM significantly outperforms previous methods in accurately describing complex human motions and subtle temporal variations, setting a new standard for motion-centric video captioning.

Paper Structure

This paper contains 30 sections, 13 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Overview of our proposed M-ACM (Motion-Augmented Caption Model) framework. The system processes input videos through dual pathways: a standard visual pathway (top) and a motion-specialized pathway (bottom). The visual pathway extracts general visual features via a frozen vision encoder, while the motion pathway uses ViTPose-based xu2022vitpose frame sampling and human mesh recovery to generate precise motion representations. Both representations are projected into a common embedding space through trainable projection alignment modules. Our key innovation, Motion Synergetic Decoding (MSD), addresses hallucination issues by comparing logit distributions from both pathways. As shown in the example, without MSD the model incorrectly identifies the basketball being handled with the "foot" (hallucination), whereas with MSD the model correctly identifies the "hand" as the body part manipulating the ball.
  • Figure 2: Performance comparison of different synergy components in our M-ACM framework. The blue bars represent accuracy scores (left y-axis), while the orange line tracks inference time in seconds (right y-axis).
  • Figure 3: The composition of QA pairs in the HMI dataset. The outer circle represents five aspects, while the inner circle shows corresponding question types for each aspect.
  • Figure 4: The process of building the HMI dataset. Top: The video processing pipeline. Initially, the public video datasets are collected for scene segmentation and cleaning. Then, DWPose yang2023effective and movement criteria are used for filtering to obtain high-quality videos. Bottom: The video annotation pipeline. Video keyframes sampled by ViTPose xu2022vitpose and the original captions are utilized for video-text collaborative annotation with GPT-4o mini. Additionally, DeepSeek-R1-Distill-Qwen-7B guo2025deepseek generates question-answer pairs based on the annotated video captions.
  • Figure 5: Performance comparison of M-ACM against other models.The radar plots illustrate our model's superior performance across standard caption metrics (left) and human motion understanding dimensions (right).