HuMoCon: Concept Discovery for Human Motion Understanding
Qihang Fang, Chengcheng Tang, Bugra Tekin, Shugao Ma, Yanchao Yang
TL;DR
HuMoCon addresses the challenge of understanding human motion from both video and motion data by explicitly aligning cross-modal features and preserving high-frequency dynamics. It introduces a velocity-aware masked autoencoder with VQ-VAE codebooks to discover motion concepts, coupled with a feature alignment loss and a two-stage LM fine-tuning pipeline including modality translation and instruction tuning. The approach achieves state-of-the-art results on ActivityNet-QA and BABEL-QA, with ablations confirming the importance of velocity reconstruction and cross-modal alignment. This work advances practical human motion understanding by enabling fine-grained reasoning across modalities and providing a scalable path for LLM-based motion reasoning.
Abstract
We present HuMoCon, a novel motion-video understanding framework designed for advanced human behavior analysis. The core of our method is a human motion concept discovery framework that efficiently trains multi-modal encoders to extract semantically meaningful and generalizable features. HuMoCon addresses key challenges in motion concept discovery for understanding and reasoning, including the lack of explicit multi-modality feature alignment and the loss of high-frequency information in masked autoencoding frameworks. Our approach integrates a feature alignment strategy that leverages video for contextual understanding and motion for fine-grained interaction modeling, further with a velocity reconstruction mechanism to enhance high-frequency feature expression and mitigate temporal over-smoothing. Comprehensive experiments on standard benchmarks demonstrate that HuMoCon enables effective motion concept discovery and significantly outperforms state-of-the-art methods in training large models for human motion understanding. We will open-source the associated code with our paper.
