Table of Contents
Fetching ...

Light-T2M: A Lightweight and Fast Model for Text-to-motion Generation

Ling-An Zeng, Guohong Huang, Gaojie Wu, Wei-Shi Zheng

TL;DR

Light-T2M tackles the deployment barriers of text-to-motion generation by integrating a lightweight Local Information Modeling Module, a Mamba-based global modeling with a Pseudo-bidirectional Scan, and an Adaptive Textual Information Injector within a diffusion framework. The model achieves about 10% of MoMask's parameter count with a 16% faster inference and competitive FID on HumanML3D and KIT-ML datasets, demonstrating strong motion-text alignment with efficient computation. This approach enables lower-cost, on-device or mobile deployment of T2M, while maintaining high-quality motion with smooth local transitions and effective textual control.

Abstract

Despite the significant role text-to-motion (T2M) generation plays across various applications, current methods involve a large number of parameters and suffer from slow inference speeds, leading to high usage costs. To address this, we aim to design a lightweight model to reduce usage costs. First, unlike existing works that focus solely on global information modeling, we recognize the importance of local information modeling in the T2M task by reconsidering the intrinsic properties of human motion, leading us to propose a lightweight Local Information Modeling Module. Second, we introduce Mamba to the T2M task, reducing the number of parameters and GPU memory demands, and we have designed a novel Pseudo-bidirectional Scan to replicate the effects of a bidirectional scan without increasing parameter count. Moreover, we propose a novel Adaptive Textual Information Injector that more effectively integrates textual information into the motion during generation. By integrating the aforementioned designs, we propose a lightweight and fast model named Light-T2M. Compared to the state-of-the-art method, MoMask, our Light-T2M model features just 10\% of the parameters (4.48M vs 44.85M) and achieves a 16\% faster inference time (0.152s vs 0.180s), while surpassing MoMask with an FID of \textbf{0.040} (vs. 0.045) on HumanML3D dataset and 0.161 (vs. 0.228) on KIT-ML dataset. The code is available at https://github.com/qinghuannn/light-t2m.

Light-T2M: A Lightweight and Fast Model for Text-to-motion Generation

TL;DR

Light-T2M tackles the deployment barriers of text-to-motion generation by integrating a lightweight Local Information Modeling Module, a Mamba-based global modeling with a Pseudo-bidirectional Scan, and an Adaptive Textual Information Injector within a diffusion framework. The model achieves about 10% of MoMask's parameter count with a 16% faster inference and competitive FID on HumanML3D and KIT-ML datasets, demonstrating strong motion-text alignment with efficient computation. This approach enables lower-cost, on-device or mobile deployment of T2M, while maintaining high-quality motion with smooth local transitions and effective textual control.

Abstract

Despite the significant role text-to-motion (T2M) generation plays across various applications, current methods involve a large number of parameters and suffer from slow inference speeds, leading to high usage costs. To address this, we aim to design a lightweight model to reduce usage costs. First, unlike existing works that focus solely on global information modeling, we recognize the importance of local information modeling in the T2M task by reconsidering the intrinsic properties of human motion, leading us to propose a lightweight Local Information Modeling Module. Second, we introduce Mamba to the T2M task, reducing the number of parameters and GPU memory demands, and we have designed a novel Pseudo-bidirectional Scan to replicate the effects of a bidirectional scan without increasing parameter count. Moreover, we propose a novel Adaptive Textual Information Injector that more effectively integrates textual information into the motion during generation. By integrating the aforementioned designs, we propose a lightweight and fast model named Light-T2M. Compared to the state-of-the-art method, MoMask, our Light-T2M model features just 10\% of the parameters (4.48M vs 44.85M) and achieves a 16\% faster inference time (0.152s vs 0.180s), while surpassing MoMask with an FID of \textbf{0.040} (vs. 0.045) on HumanML3D dataset and 0.161 (vs. 0.228) on KIT-ML dataset. The code is available at https://github.com/qinghuannn/light-t2m.

Paper Structure

This paper contains 26 sections, 7 equations, 6 figures, 9 tables, 2 algorithms.

Figures (6)

  • Figure 1: Comparison on FID and the number of parameters. The closer the model is to the origin, the better. Only trainable parameters are calculated.
  • Figure 2: Overview of our Light-T2M. (a) Our Light-T2M consisting of $N$ basic blocks aims to predict $M_*^0$, and then $M^{t-1}$ can be calculated via Eq. \ref{['eq:2']}. (b) The structure of our lightweight Local Information Modeling Module. (c) The motion is downsampled to obtain segments containing local semantic information. Next, a novel Adaptive Textual Information Injector and a Mamba Block are adopted to adaptively inject semantics into each segment and model the global information, respectively. The upsampled motion and the original motion are fused by a fusion layer. (d) The overview of the inference process.
  • Figure 3: In our pseudo-bidirectional scan, each element in the original sequence can obtain the information from elements originally on its right, achieving the effect of bidirectional scanning without increasing parameters.
  • Figure 4: Illustration of our Adaptive Textual Information Injector.$\odot$ and $\copyright$ denote dot product and concatenation, respectively.
  • Figure 5: Qualitative comparisons on the HumanML3D dataset. The areas highlighted in red indicate where the generated content does not correspond to the given text or where there are issues such as limb distortion. We also use dashed lines to display the character's movement path, with green and red indicating whether it corresponds to the given text, respectively.
  • ...and 1 more figures