Human Motion Instruction Tuning
Lei Li, Sen Jia, Jianhao Wang, Zhongyu Jiang, Feng Zhou, Ju Dai, Tianfang Zhang, Zongkai Wu, Jenq-Neng Hwang
TL;DR
LLaMo addresses the challenge of enabling large language models to reason about human motion without collapsing motion data into language tokens. It introduces a Motion Estimator, Motion Feature Enhancer, and Cross Talker to align motion with text, enabling direct input of video and motion modalities. The approach preserves motion-specific details, uses language-guided viewpoint frame selection and adaptive context aggregation, and demonstrates state-of-the-art results on MoVid-Bench, BABEL-QA, Swing, and Mo-RepCount. This work highlights the potential of motion-centric multimodal AI for sports analytics, healthcare, and behavioral reasoning, and lays groundwork for efficient real-time, human-centric multimodal systems.
Abstract
This paper presents LLaMo (Large Language and Human Motion Assistant), a multimodal framework for human motion instruction tuning. In contrast to conventional instruction-tuning approaches that convert non-linguistic inputs, such as video or motion sequences, into language tokens, LLaMo retains motion in its native form for instruction tuning. This method preserves motion-specific details that are often diminished in tokenization, thereby improving the model's ability to interpret complex human behaviors. By processing both video and motion data alongside textual inputs, LLaMo enables a flexible, human-centric analysis. Experimental evaluations across high-complexity domains, including human behaviors and professional activities, indicate that LLaMo effectively captures domain-specific knowledge, enhancing comprehension and prediction in motion-intensive scenarios. We hope LLaMo offers a foundation for future multimodal AI systems with broad applications, from sports analytics to behavioral prediction. Our code and models are available on the project website: https://github.com/ILGLJ/LLaMo.
