Table of Contents
Fetching ...

FTMoMamba: Motion Generation with Frequency and Text State Space Models

Chengjian Li, Xiangbo Shu, Qiongjie Cui, Yazhou Yao, Jinhui Tang

TL;DR

A novel diffusion-based FTMoMamba framework equipped with a Frequency State Space Model (FreqSSM) and a Text State Space Model (TextSSM) that achieves superior performance on the text-to-motion generation task, especially gaining the lowest FID on the HumanML3D dataset.

Abstract

Diffusion models achieve impressive performance in human motion generation. However, current approaches typically ignore the significance of frequency-domain information in capturing fine-grained motions within the latent space (e.g., low frequencies correlate with static poses, and high frequencies align with fine-grained motions). Additionally, there is a semantic discrepancy between text and motion, leading to inconsistency between the generated motions and the text descriptions. In this work, we propose a novel diffusion-based FTMoMamba framework equipped with a Frequency State Space Model (FreqSSM) and a Text State Space Model (TextSSM). Specifically, to learn fine-grained representation, FreqSSM decomposes sequences into low-frequency and high-frequency components, guiding the generation of static pose (e.g., sits, lay) and fine-grained motions (e.g., transition, stumble), respectively. To ensure the consistency between text and motion, TextSSM encodes text features at the sentence level, aligning textual semantics with sequential features. Extensive experiments show that FTMoMamba achieves superior performance on the text-to-motion generation task, especially gaining the lowest FID of 0.181 (rather lower than 0.421 of MLD) on the HumanML3D dataset.

FTMoMamba: Motion Generation with Frequency and Text State Space Models

TL;DR

A novel diffusion-based FTMoMamba framework equipped with a Frequency State Space Model (FreqSSM) and a Text State Space Model (TextSSM) that achieves superior performance on the text-to-motion generation task, especially gaining the lowest FID on the HumanML3D dataset.

Abstract

Diffusion models achieve impressive performance in human motion generation. However, current approaches typically ignore the significance of frequency-domain information in capturing fine-grained motions within the latent space (e.g., low frequencies correlate with static poses, and high frequencies align with fine-grained motions). Additionally, there is a semantic discrepancy between text and motion, leading to inconsistency between the generated motions and the text descriptions. In this work, we propose a novel diffusion-based FTMoMamba framework equipped with a Frequency State Space Model (FreqSSM) and a Text State Space Model (TextSSM). Specifically, to learn fine-grained representation, FreqSSM decomposes sequences into low-frequency and high-frequency components, guiding the generation of static pose (e.g., sits, lay) and fine-grained motions (e.g., transition, stumble), respectively. To ensure the consistency between text and motion, TextSSM encodes text features at the sentence level, aligning textual semantics with sequential features. Extensive experiments show that FTMoMamba achieves superior performance on the text-to-motion generation task, especially gaining the lowest FID of 0.181 (rather lower than 0.421 of MLD) on the HumanML3D dataset.

Paper Structure

This paper contains 12 sections, 18 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Performance show of FTMoMamba. (a) Qualitative results compared with the baseline (i.e. MLD chen2023executing). Our method shows two merits: i) well controls static poses (e.g., sit) and fine-grained motions (e.g., transition); ii) well ensures text-motion alignment (e.g., object). (b) Quantitative results. Our method (mainly consisting of FreqMamba and TextMamba) achieves better performance with lower computational cost.
  • Figure 2: Overview of FTMoMamba. FTMoMamba is built upon the Diffusion model within FTMamba modules, which explores the frequency-domain information to guide motion generation, as well as text-semantic information to ensure text-motion consistency in the latent space. Specifically, the diffusion model compresses and decompresses the raw motion sequence, reducing the interference of redundant information in motion generation. FTMamba, as the core of the denoising module, consists of FreqMamba and TextMamba. The former decomposes motion sequences into low- and high-frequency components to guide the generation of static and fine-grained motions, respectively. The latter aligns textual semantics with sequential features to ensure text-motion consistency.
  • Figure 3: Architecture of FreqSSM. It decomposes motion features into low-frequency ($\textbf{f}_{\text{low}}$) and high-frequency ($\textbf{f}_{\text{high}}$) features. Then, it reconstructs the state transition matrix $\textbf{A}_n$ using the frequency-domain features.
  • Figure 4: Architecture of TextSSM. TextSSM integrates the text features $\textbf{f}^t$ and the output matrix C. Under negligible computational cost, we reconstruct the Text State Space Model to enable cross-modal fusion.
  • Figure 5: Qualitative comparison on HumanML3D test dataset.
  • ...and 1 more figures