Table of Contents
Fetching ...

HuMoCon: Concept Discovery for Human Motion Understanding

Qihang Fang, Chengcheng Tang, Bugra Tekin, Shugao Ma, Yanchao Yang

TL;DR

HuMoCon addresses the challenge of understanding human motion from both video and motion data by explicitly aligning cross-modal features and preserving high-frequency dynamics. It introduces a velocity-aware masked autoencoder with VQ-VAE codebooks to discover motion concepts, coupled with a feature alignment loss and a two-stage LM fine-tuning pipeline including modality translation and instruction tuning. The approach achieves state-of-the-art results on ActivityNet-QA and BABEL-QA, with ablations confirming the importance of velocity reconstruction and cross-modal alignment. This work advances practical human motion understanding by enabling fine-grained reasoning across modalities and providing a scalable path for LLM-based motion reasoning.

Abstract

We present HuMoCon, a novel motion-video understanding framework designed for advanced human behavior analysis. The core of our method is a human motion concept discovery framework that efficiently trains multi-modal encoders to extract semantically meaningful and generalizable features. HuMoCon addresses key challenges in motion concept discovery for understanding and reasoning, including the lack of explicit multi-modality feature alignment and the loss of high-frequency information in masked autoencoding frameworks. Our approach integrates a feature alignment strategy that leverages video for contextual understanding and motion for fine-grained interaction modeling, further with a velocity reconstruction mechanism to enhance high-frequency feature expression and mitigate temporal over-smoothing. Comprehensive experiments on standard benchmarks demonstrate that HuMoCon enables effective motion concept discovery and significantly outperforms state-of-the-art methods in training large models for human motion understanding. We will open-source the associated code with our paper.

HuMoCon: Concept Discovery for Human Motion Understanding

TL;DR

HuMoCon addresses the challenge of understanding human motion from both video and motion data by explicitly aligning cross-modal features and preserving high-frequency dynamics. It introduces a velocity-aware masked autoencoder with VQ-VAE codebooks to discover motion concepts, coupled with a feature alignment loss and a two-stage LM fine-tuning pipeline including modality translation and instruction tuning. The approach achieves state-of-the-art results on ActivityNet-QA and BABEL-QA, with ablations confirming the importance of velocity reconstruction and cross-modal alignment. This work advances practical human motion understanding by enabling fine-grained reasoning across modalities and providing a scalable path for LLM-based motion reasoning.

Abstract

We present HuMoCon, a novel motion-video understanding framework designed for advanced human behavior analysis. The core of our method is a human motion concept discovery framework that efficiently trains multi-modal encoders to extract semantically meaningful and generalizable features. HuMoCon addresses key challenges in motion concept discovery for understanding and reasoning, including the lack of explicit multi-modality feature alignment and the loss of high-frequency information in masked autoencoding frameworks. Our approach integrates a feature alignment strategy that leverages video for contextual understanding and motion for fine-grained interaction modeling, further with a velocity reconstruction mechanism to enhance high-frequency feature expression and mitigate temporal over-smoothing. Comprehensive experiments on standard benchmarks demonstrate that HuMoCon enables effective motion concept discovery and significantly outperforms state-of-the-art methods in training large models for human motion understanding. We will open-source the associated code with our paper.

Paper Structure

This paper contains 39 sections, 12 equations, 10 figures, 7 tables.

Figures (10)

  • Figure 1: We present HuMoCon for concept discovery and human motion understanding with video or motion input. To address the challenge of feature misalignment and high-frequency information loss, we propose a novel feature alignment strategy and an advanced masked auto-encoder reconstructing velocity. HuMoCon empowers effective motion concept discovery and accurate Question Answering tasks, significantly outperforming state-of-the-art methods qualitatively and quantitatively.
  • Figure 2: System overview of our method. (a) The encoder pre-training process for learning and aligning video and motion features and enhancing high-frequency details through velocity reconstruction. We utilize a VQ-VAE-based structure, and we design effective learning objectives to enhance the encoder to extract semantic meaningful and fine-grained features. (b) The fine-tuning of the large language model (LLM) for video and motion reasoning consists of two stages: Modality Translation and Multi-modality Instruction Tuning. In the Modality Translation stage, we train a translation layer for each modality to map the encoding feature to the LLM space. In the Instruction Tuning stage, we fine-tune the LLM to understand human motion and videos for downstream tasks.
  • Figure 3: Overview of the velocity reconstruction components. We build similar network structures for both video and motion, and we present the video part in this figure. This module is composed of two learning objectives. 1) Discriminative informativeness (left) aims to improve the distinctiveness of encoded features by reducing representational ambiguity. 2) Actionable informativeness (right) focuses on reconstructing the velocity by leveraging gradient information from the discrimination hypernetwork. As for the video data, we employ optical flow as the representation of the velocity.
  • Figure 4: Example of motion understanding. We validate the proposed model’s ability to understand motion from multiple aspects. Q1 on the left and Q1--3 on the right evaluate its comprehension of motion sequences, while Q2 and Q3 on the left push it to analyze kinematic, kinesthetic, and physical properties. These results demonstrate our algorithm's ability to recognize, describe in detail, and analyze motions.
  • Figure 5: The structure of the hyper-network.
  • ...and 5 more figures