Table of Contents
Fetching ...

HMVLM: Human Motion-Vision-Lanuage Model via MoE LoRA

Lei Hu, Yongjing Ye, Shihong Xia

TL;DR

This work addresses the challenge of incorporating 3D human motion into foundation language models without eroding existing world knowledge and with autoregressive-compatible pose representations. It introduces HMVLM, a Mixture of Expert LoRA framework that includes a non-trainable zero expert and body-part-based tokenizers to support text-to-motion, pose estimation, and motion video understanding under instruction tuning. The approach achieves effective knowledge preservation, competitive or superior performance across motion-centric tasks, and interpretable task specialization via gating weights. The combination of MoE LoRA with zero-expert preservation and spatially aware pose/motion tokenization offers a scalable path toward unified, multitask human-centric multimodal models.

Abstract

The expansion of instruction-tuning data has enabled foundation language models to exhibit improved instruction adherence and superior performance across diverse downstream tasks. Semantically-rich 3D human motion is being progressively integrated with these foundation models to enhance multimodal understanding and cross-modal generation capabilities. However, the modality gap between human motion and text raises unresolved concerns about catastrophic forgetting during this integration. In addition, developing autoregressive-compatible pose representations that preserve generalizability across heterogeneous downstream tasks remains a critical technical barrier. To address these issues, we propose the Human Motion-Vision-Language Model (HMVLM), a unified framework based on the Mixture of Expert Low-Rank Adaption(MoE LoRA) strategy. The framework leverages the gating network to dynamically allocate LoRA expert weights based on the input prompt, enabling synchronized fine-tuning of multiple tasks. To mitigate catastrophic forgetting during instruction-tuning, we introduce a novel zero expert that preserves the pre-trained parameters for general linguistic tasks. For pose representation, we implement body-part-specific tokenization by partitioning the human body into different joint groups, enhancing the spatial resolution of the representation. Experiments show that our method effectively alleviates knowledge forgetting during instruction-tuning and achieves remarkable performance across diverse human motion downstream tasks.

HMVLM: Human Motion-Vision-Lanuage Model via MoE LoRA

TL;DR

This work addresses the challenge of incorporating 3D human motion into foundation language models without eroding existing world knowledge and with autoregressive-compatible pose representations. It introduces HMVLM, a Mixture of Expert LoRA framework that includes a non-trainable zero expert and body-part-based tokenizers to support text-to-motion, pose estimation, and motion video understanding under instruction tuning. The approach achieves effective knowledge preservation, competitive or superior performance across motion-centric tasks, and interpretable task specialization via gating weights. The combination of MoE LoRA with zero-expert preservation and spatially aware pose/motion tokenization offers a scalable path toward unified, multitask human-centric multimodal models.

Abstract

The expansion of instruction-tuning data has enabled foundation language models to exhibit improved instruction adherence and superior performance across diverse downstream tasks. Semantically-rich 3D human motion is being progressively integrated with these foundation models to enhance multimodal understanding and cross-modal generation capabilities. However, the modality gap between human motion and text raises unresolved concerns about catastrophic forgetting during this integration. In addition, developing autoregressive-compatible pose representations that preserve generalizability across heterogeneous downstream tasks remains a critical technical barrier. To address these issues, we propose the Human Motion-Vision-Language Model (HMVLM), a unified framework based on the Mixture of Expert Low-Rank Adaption(MoE LoRA) strategy. The framework leverages the gating network to dynamically allocate LoRA expert weights based on the input prompt, enabling synchronized fine-tuning of multiple tasks. To mitigate catastrophic forgetting during instruction-tuning, we introduce a novel zero expert that preserves the pre-trained parameters for general linguistic tasks. For pose representation, we implement body-part-specific tokenization by partitioning the human body into different joint groups, enhancing the spatial resolution of the representation. Experiments show that our method effectively alleviates knowledge forgetting during instruction-tuning and achieves remarkable performance across diverse human motion downstream tasks.

Paper Structure

This paper contains 27 sections, 7 equations, 11 figures, 7 tables.

Figures (11)

  • Figure 1: HMVLM preserves the original knowledge and dialogue capabilities of the foundation model while supporting a wide range of human-centric downstream tasks.
  • Figure 2: Method overview: task instructions and input prompt are processed by a gating network to produce a mixture weights. Modality-specific inputs are aligned with word embedding via projection layers, and the final outputs are generated through the pre-trained model and the weighted combination of LoRA experts.
  • Figure 3: (a) Pose/motion tokenization scheme, we introduce learnable body-part parameters into the Transformer to facilitate feature pooling and quantization; (b) instruction tuning for diverse human-centric tasks. The discrete tokens are added to the foundation model's vocabulary, and then instruction tuning guides the model in generating task-related tokens.
  • Figure 4: Qualitative results for human pose estimation and human video understanding.
  • Figure 5: Efficiency analysis of the MoE LoRA model under different numbers of experts. (a) Training time and parameter scaling. (b) Inference latency and T2M performance.
  • ...and 6 more figures