Table of Contents
Fetching ...

FLAME: Free-form Language-based Motion Synthesis & Editing

Jihoon Kim, Jiseob Kim, Sungjoon Choi

TL;DR

FLAME tackles text-to-motion generation and editing with a diffusion-based framework that conditions motion on free-form language via a transformer decoder and a RoBERTa text encoder. It introduces time-step and motion-length tokens to handle temporal structure and variable-length motions, and employs classifier-free guidance for high semantic alignment during synthesis while enabling editing without fine-tuning. The model achieves state-of-the-art results on HumanML3D, BABEL, and KIT, and demonstrates versatile editing capabilities that extend to motion prediction and in-betweening. Together, these advances enable diverse, controllable motion generation from natural language in animation, gaming, and robotics pipelines, with practical speedups from reduced diffusion steps.

Abstract

Text-based motion generation models are drawing a surge of interest for their potential for automating the motion-making process in the game, animation, or robot industries. In this paper, we propose a diffusion-based motion synthesis and editing model named FLAME. Inspired by the recent successes in diffusion models, we integrate diffusion-based generative models into the motion domain. FLAME can generate high-fidelity motions well aligned with the given text. Also, it can edit the parts of the motion, both frame-wise and joint-wise, without any fine-tuning. FLAME involves a new transformer-based architecture we devise to better handle motion data, which is found to be crucial to manage variable-length motions and well attend to free-form text. In experiments, we show that FLAME achieves state-of-the-art generation performances on three text-motion datasets: HumanML3D, BABEL, and KIT. We also demonstrate that editing capability of FLAME can be extended to other tasks such as motion prediction or motion in-betweening, which have been previously covered by dedicated models.

FLAME: Free-form Language-based Motion Synthesis & Editing

TL;DR

FLAME tackles text-to-motion generation and editing with a diffusion-based framework that conditions motion on free-form language via a transformer decoder and a RoBERTa text encoder. It introduces time-step and motion-length tokens to handle temporal structure and variable-length motions, and employs classifier-free guidance for high semantic alignment during synthesis while enabling editing without fine-tuning. The model achieves state-of-the-art results on HumanML3D, BABEL, and KIT, and demonstrates versatile editing capabilities that extend to motion prediction and in-betweening. Together, these advances enable diverse, controllable motion generation from natural language in animation, gaming, and robotics pipelines, with practical speedups from reduced diffusion steps.

Abstract

Text-based motion generation models are drawing a surge of interest for their potential for automating the motion-making process in the game, animation, or robot industries. In this paper, we propose a diffusion-based motion synthesis and editing model named FLAME. Inspired by the recent successes in diffusion models, we integrate diffusion-based generative models into the motion domain. FLAME can generate high-fidelity motions well aligned with the given text. Also, it can edit the parts of the motion, both frame-wise and joint-wise, without any fine-tuning. FLAME involves a new transformer-based architecture we devise to better handle motion data, which is found to be crucial to manage variable-length motions and well attend to free-form text. In experiments, we show that FLAME achieves state-of-the-art generation performances on three text-motion datasets: HumanML3D, BABEL, and KIT. We also demonstrate that editing capability of FLAME can be extended to other tasks such as motion prediction or motion in-betweening, which have been previously covered by dedicated models.
Paper Structure (29 sections, 9 equations, 6 figures, 6 tables)

This paper contains 29 sections, 9 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Overview of text-to-motion synthesis and text-based motion editing. Motion flows from left to right.
  • Figure 2: Overview of architecture.
  • Figure 3: Qualitative results on text-to-motion synthesis task. Motion sequences flow from left to right.
  • Figure 4: Quantitative results with different numbers of sampling steps. Same trained model with $T=1000$ steps are used.
  • Figure 5: Qualitative results on text-based motion editing. FLAME edits reference motion with given prompts. The model is allowed to edit from both shoulders to hands in this example. Motion flows from left to right.
  • ...and 1 more figures