Mojito: LLM-Aided Motion Instructor with Jitter-Reduced Inertial Tokens
Ziwei Shan, Yaoyu He, Chengfeng Zhao, Jiashen Du, Jingyan Zhang, Qixuan Zhang, Jingyi Yu, Lan Xu
TL;DR
Mojito addresses robust real-time motion capture and online behavioral analysis using sparse IMU sensors integrated with large language models. The approach encodes continuous IMU streams into jitter-reduced discrete inertial tokens via a motion-aware VQ-VAE and an IMU tokenizer, learns a shared latent space with Zipf regularization, and projects inertial tokens into the LM embedding space for end-to-end reasoning. It couples a projection module and LoRA-tuned adapters within a frozen Qwen-2-7B-Instruct LM, enabling descriptive and instructive feedback with customizable styles. Experiments demonstrate improved robustness to sensor noise and drift, high-quality textual feedback comparable to vision-language models, and a web demo plus user study confirming practical utility in fitness and rehabilitation. The work advances real-time, privacy-preserving multimodal motion understanding by bridging inertial sensing and natural language.
Abstract
Human bodily movements convey critical insights into action intentions and cognitive processes, yet existing multimodal systems primarily focused on understanding human motion via language, vision, and audio, which struggle to capture the dynamic forces and torques inherent in 3D motion. Inertial measurement units (IMUs) present a promising alternative, offering lightweight, wearable, and privacy-conscious motion sensing. However, processing of streaming IMU data faces challenges such as wireless transmission instability, sensor noise, and drift, limiting their utility for long-term real-time motion capture (MoCap), and more importantly, online motion analysis. To address these challenges, we introduce Mojito, an intelligent motion agent that integrates inertial sensing with large language models (LLMs) for interactive motion capture and behavioral analysis.
