Table of Contents
Fetching ...

Human Motion Instruction Tuning

Lei Li, Sen Jia, Jianhao Wang, Zhongyu Jiang, Feng Zhou, Ju Dai, Tianfang Zhang, Zongkai Wu, Jenq-Neng Hwang

TL;DR

LLaMo addresses the challenge of enabling large language models to reason about human motion without collapsing motion data into language tokens. It introduces a Motion Estimator, Motion Feature Enhancer, and Cross Talker to align motion with text, enabling direct input of video and motion modalities. The approach preserves motion-specific details, uses language-guided viewpoint frame selection and adaptive context aggregation, and demonstrates state-of-the-art results on MoVid-Bench, BABEL-QA, Swing, and Mo-RepCount. This work highlights the potential of motion-centric multimodal AI for sports analytics, healthcare, and behavioral reasoning, and lays groundwork for efficient real-time, human-centric multimodal systems.

Abstract

This paper presents LLaMo (Large Language and Human Motion Assistant), a multimodal framework for human motion instruction tuning. In contrast to conventional instruction-tuning approaches that convert non-linguistic inputs, such as video or motion sequences, into language tokens, LLaMo retains motion in its native form for instruction tuning. This method preserves motion-specific details that are often diminished in tokenization, thereby improving the model's ability to interpret complex human behaviors. By processing both video and motion data alongside textual inputs, LLaMo enables a flexible, human-centric analysis. Experimental evaluations across high-complexity domains, including human behaviors and professional activities, indicate that LLaMo effectively captures domain-specific knowledge, enhancing comprehension and prediction in motion-intensive scenarios. We hope LLaMo offers a foundation for future multimodal AI systems with broad applications, from sports analytics to behavioral prediction. Our code and models are available on the project website: https://github.com/ILGLJ/LLaMo.

Human Motion Instruction Tuning

TL;DR

LLaMo addresses the challenge of enabling large language models to reason about human motion without collapsing motion data into language tokens. It introduces a Motion Estimator, Motion Feature Enhancer, and Cross Talker to align motion with text, enabling direct input of video and motion modalities. The approach preserves motion-specific details, uses language-guided viewpoint frame selection and adaptive context aggregation, and demonstrates state-of-the-art results on MoVid-Bench, BABEL-QA, Swing, and Mo-RepCount. This work highlights the potential of motion-centric multimodal AI for sports analytics, healthcare, and behavioral reasoning, and lays groundwork for efficient real-time, human-centric multimodal systems.

Abstract

This paper presents LLaMo (Large Language and Human Motion Assistant), a multimodal framework for human motion instruction tuning. In contrast to conventional instruction-tuning approaches that convert non-linguistic inputs, such as video or motion sequences, into language tokens, LLaMo retains motion in its native form for instruction tuning. This method preserves motion-specific details that are often diminished in tokenization, thereby improving the model's ability to interpret complex human behaviors. By processing both video and motion data alongside textual inputs, LLaMo enables a flexible, human-centric analysis. Experimental evaluations across high-complexity domains, including human behaviors and professional activities, indicate that LLaMo effectively captures domain-specific knowledge, enhancing comprehension and prediction in motion-intensive scenarios. We hope LLaMo offers a foundation for future multimodal AI systems with broad applications, from sports analytics to behavioral prediction. Our code and models are available on the project website: https://github.com/ILGLJ/LLaMo.

Paper Structure

This paper contains 24 sections, 13 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: A comparison of MotionLLM chen2024motionllm, MotionGPT chen2023motiongpt, and LLaMo highlights LLaMo's motion-specific capabilities. Equipped with a Motion Enhancer and Cross Talker module to align motion and text, LLaMo supports both video and motion inputs, enabling text-aware, fine-grained motion analysis.
  • Figure 2: Overview of the LLaMo framework. It includes three main modules: (1) Multimodal Feature Extraction for encoding video and motion data; (2) Cross Talker for aligning and fusing motion and text features; and (3) Behavior Generation Module to produce text descriptions of human behavior based on integrated features.
  • Figure 3: Overview of the Cross Talker Module, which selects key frames based on text guidance and fuses them with text features for enhanced analysis.
  • Figure 4: Example outputs from LLaMo across human activities and professional sports, showcasing its reasoning capabilities and domain-specific knowledge in motion-intensive scenarios.