Table of Contents
Fetching ...

Mojito: LLM-Aided Motion Instructor with Jitter-Reduced Inertial Tokens

Ziwei Shan, Yaoyu He, Chengfeng Zhao, Jiashen Du, Jingyan Zhang, Qixuan Zhang, Jingyi Yu, Lan Xu

TL;DR

Mojito addresses robust real-time motion capture and online behavioral analysis using sparse IMU sensors integrated with large language models. The approach encodes continuous IMU streams into jitter-reduced discrete inertial tokens via a motion-aware VQ-VAE and an IMU tokenizer, learns a shared latent space with Zipf regularization, and projects inertial tokens into the LM embedding space for end-to-end reasoning. It couples a projection module and LoRA-tuned adapters within a frozen Qwen-2-7B-Instruct LM, enabling descriptive and instructive feedback with customizable styles. Experiments demonstrate improved robustness to sensor noise and drift, high-quality textual feedback comparable to vision-language models, and a web demo plus user study confirming practical utility in fitness and rehabilitation. The work advances real-time, privacy-preserving multimodal motion understanding by bridging inertial sensing and natural language.

Abstract

Human bodily movements convey critical insights into action intentions and cognitive processes, yet existing multimodal systems primarily focused on understanding human motion via language, vision, and audio, which struggle to capture the dynamic forces and torques inherent in 3D motion. Inertial measurement units (IMUs) present a promising alternative, offering lightweight, wearable, and privacy-conscious motion sensing. However, processing of streaming IMU data faces challenges such as wireless transmission instability, sensor noise, and drift, limiting their utility for long-term real-time motion capture (MoCap), and more importantly, online motion analysis. To address these challenges, we introduce Mojito, an intelligent motion agent that integrates inertial sensing with large language models (LLMs) for interactive motion capture and behavioral analysis.

Mojito: LLM-Aided Motion Instructor with Jitter-Reduced Inertial Tokens

TL;DR

Mojito addresses robust real-time motion capture and online behavioral analysis using sparse IMU sensors integrated with large language models. The approach encodes continuous IMU streams into jitter-reduced discrete inertial tokens via a motion-aware VQ-VAE and an IMU tokenizer, learns a shared latent space with Zipf regularization, and projects inertial tokens into the LM embedding space for end-to-end reasoning. It couples a projection module and LoRA-tuned adapters within a frozen Qwen-2-7B-Instruct LM, enabling descriptive and instructive feedback with customizable styles. Experiments demonstrate improved robustness to sensor noise and drift, high-quality textual feedback comparable to vision-language models, and a web demo plus user study confirming practical utility in fitness and rehabilitation. The work advances real-time, privacy-preserving multimodal motion understanding by bridging inertial sensing and natural language.

Abstract

Human bodily movements convey critical insights into action intentions and cognitive processes, yet existing multimodal systems primarily focused on understanding human motion via language, vision, and audio, which struggle to capture the dynamic forces and torques inherent in 3D motion. Inertial measurement units (IMUs) present a promising alternative, offering lightweight, wearable, and privacy-conscious motion sensing. However, processing of streaming IMU data faces challenges such as wireless transmission instability, sensor noise, and drift, limiting their utility for long-term real-time motion capture (MoCap), and more importantly, online motion analysis. To address these challenges, we introduce Mojito, an intelligent motion agent that integrates inertial sensing with large language models (LLMs) for interactive motion capture and behavioral analysis.

Paper Structure

This paper contains 33 sections, 17 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Overview of our training pipeline. We quantize continuous and jittery IMU signals to a sequence of jitter-reduced and motion-aware inertial tokens by learning a IMU tokenizer through distribution matching strategy and adopt semantic aligned and LoRA fine-tuned LLM to generate precise, professional and stylistic text feedback for human motion analysis.
  • Figure 2: IMU Tokenizing Process. The rotation, acceleration, and angular velocity components of the IMU signal are first flattened and concatenated. The resulting sequence is then processed by an encoder comprising multiple 1D convolutional layers and subsequently passed through a quantizer to generate the jitter reduced inertial tokens.
  • Figure 3: Data Generation Pipeline. The corresponding motion label is first extracted and expanded into a descriptive sentence using the LLM. Subsequently, a prompt is employed to generate a more refined and professional description or instructional output.
  • Figure 4: Inference Pipeline. Jittery IMU signals are first tokenized into jitter-reduced inertial tokens. These tokens are concurrently processed in two ways: (1) they are decoded by the learned motion decoder to reconstruct human motion, and (2) they are projected into the language semantic space via the pretrained projection module for motion analysis.
  • Figure 5: Results Gallery. We present input IMU signals, MoCap results, system analysis, and RGB references.
  • ...and 3 more figures