FLAME: Free-form Language-based Motion Synthesis & Editing

Jihoon Kim; Jiseob Kim; Sungjoon Choi

FLAME: Free-form Language-based Motion Synthesis & Editing

Jihoon Kim, Jiseob Kim, Sungjoon Choi

TL;DR

FLAME tackles text-to-motion generation and editing with a diffusion-based framework that conditions motion on free-form language via a transformer decoder and a RoBERTa text encoder. It introduces time-step and motion-length tokens to handle temporal structure and variable-length motions, and employs classifier-free guidance for high semantic alignment during synthesis while enabling editing without fine-tuning. The model achieves state-of-the-art results on HumanML3D, BABEL, and KIT, and demonstrates versatile editing capabilities that extend to motion prediction and in-betweening. Together, these advances enable diverse, controllable motion generation from natural language in animation, gaming, and robotics pipelines, with practical speedups from reduced diffusion steps.

Abstract

Text-based motion generation models are drawing a surge of interest for their potential for automating the motion-making process in the game, animation, or robot industries. In this paper, we propose a diffusion-based motion synthesis and editing model named FLAME. Inspired by the recent successes in diffusion models, we integrate diffusion-based generative models into the motion domain. FLAME can generate high-fidelity motions well aligned with the given text. Also, it can edit the parts of the motion, both frame-wise and joint-wise, without any fine-tuning. FLAME involves a new transformer-based architecture we devise to better handle motion data, which is found to be crucial to manage variable-length motions and well attend to free-form text. In experiments, we show that FLAME achieves state-of-the-art generation performances on three text-motion datasets: HumanML3D, BABEL, and KIT. We also demonstrate that editing capability of FLAME can be extended to other tasks such as motion prediction or motion in-betweening, which have been previously covered by dedicated models.

FLAME: Free-form Language-based Motion Synthesis & Editing

TL;DR

Abstract

Paper Structure (29 sections, 9 equations, 6 figures, 6 tables)

This paper contains 29 sections, 9 equations, 6 figures, 6 tables.

Introduction
Related Work
Diffusion Models and Text-conditional Generation
3D Human Motion Generation
Proposed Method: FLAME
Diffusion-based Modeling
Training & Loss Functions
Model Architecture for Motion Data
Transformer Decoder
Pre-trained Language Model (PLM)
Time-Step (TS) and Motion-Length (ML) Tokens
Inference for Motion Synthesis
Inference for Motion Editing
Experiments
Datasets
...and 14 more sections

Figures (6)

Figure 1: Overview of text-to-motion synthesis and text-based motion editing. Motion flows from left to right.
Figure 2: Overview of architecture.
Figure 3: Qualitative results on text-to-motion synthesis task. Motion sequences flow from left to right.
Figure 4: Quantitative results with different numbers of sampling steps. Same trained model with $T=1000$ steps are used.
Figure 5: Qualitative results on text-based motion editing. FLAME edits reference motion with given prompts. The model is allowed to edit from both shoulders to hands in this example. Motion flows from left to right.
...and 1 more figures

FLAME: Free-form Language-based Motion Synthesis & Editing

TL;DR

Abstract

FLAME: Free-form Language-based Motion Synthesis & Editing

Authors

TL;DR

Abstract

Table of Contents

Figures (6)