Towards Fine-Grained Human Motion Video Captioning
Guorui Song, Guocun Wang, Zhe Huang, Jing Lin, Xuefei Zhe, Jian Li, Haoqian Wang
TL;DR
This work tackles the difficulty of fine-grained human motion captioning in videos by introducing the Motion-Augmented Caption Model (M-ACM) with Motion Synergetic Decoding (MSD), a dual-pathway architecture that fuses standard visual features with SMPL-based motion representations to reduce hallucinations and improve semantic and spatial fidelity. It also introduces the Human Motion Insight (HMI) dataset (115K videos, 1,031K QA pairs) and HMI-Bench, a motion-centric benchmark for evaluating captioning and QA across detailed movement dimensions. MSD integrates five L_i components to compute a joint synergy score S(y_T) used for token selection and grounding in both modalities, leading to stronger motion descriptions and fewer errors. Experiments on two foundation models (Qwen2 7B and Llama3 8B) demonstrate that M-ACM achieves superior performance across standard caption metrics and motion-understanding dimensions, indicating significant practical value for precise human motion understanding in video captioning tasks.
Abstract
Generating accurate descriptions of human actions in videos remains a challenging task for video captioning models. Existing approaches often struggle to capture fine-grained motion details, resulting in vague or semantically inconsistent captions. In this work, we introduce the Motion-Augmented Caption Model (M-ACM), a novel generative framework that enhances caption quality by incorporating motion-aware decoding. At its core, M-ACM leverages motion representations derived from human mesh recovery to explicitly highlight human body dynamics, thereby reducing hallucinations and improving both semantic fidelity and spatial alignment in the generated captions. To support research in this area, we present the Human Motion Insight (HMI) Dataset, comprising 115K video-description pairs focused on human movement, along with HMI-Bench, a dedicated benchmark for evaluating motion-focused video captioning. Experimental results demonstrate that M-ACM significantly outperforms previous methods in accurately describing complex human motions and subtle temporal variations, setting a new standard for motion-centric video captioning.
