Table of Contents
Fetching ...

MotionFix: Text-Driven 3D Human Motion Editing

Nikos Athanasiou, Alpár Cseke, Markos Diomataris, Michael J. Black, Gül Varol

TL;DR

A methodology to semi-automatically collect a dataset of triplets in the form of a source motion, a target motion, and an edit text, and create the new MotionFix dataset is built, and a conditional diffusion model, TMED, that takes both the source motion and the edit text as input is trained.

Abstract

The focus of this paper is on 3D motion editing. Given a 3D human motion and a textual description of the desired modification, our goal is to generate an edited motion as described by the text. The key challenges include the scarcity of training data and the need to design a model that accurately edits the source motion. In this paper, we address both challenges. We propose a methodology to semi-automatically collect a dataset of triplets comprising (i) a source motion, (ii) a target motion, and (iii) an edit text, introducing the new MotionFix dataset. Access to this data allows us to train a conditional diffusion model, TMED, that takes both the source motion and the edit text as input. We develop several baselines to evaluate our model, comparing it against models trained solely on text-motion pair datasets, and demonstrate the superior performance of our model trained on triplets. We also introduce new retrieval-based metrics for motion editing, establishing a benchmark on the evaluation set of MotionFix. Our results are promising, paving the way for further research in fine-grained motion generation. Code, models, and data are available at https://motionfix.is.tue.mpg.de/ .

MotionFix: Text-Driven 3D Human Motion Editing

TL;DR

A methodology to semi-automatically collect a dataset of triplets in the form of a source motion, a target motion, and an edit text, and create the new MotionFix dataset is built, and a conditional diffusion model, TMED, that takes both the source motion and the edit text as input is trained.

Abstract

The focus of this paper is on 3D motion editing. Given a 3D human motion and a textual description of the desired modification, our goal is to generate an edited motion as described by the text. The key challenges include the scarcity of training data and the need to design a model that accurately edits the source motion. In this paper, we address both challenges. We propose a methodology to semi-automatically collect a dataset of triplets comprising (i) a source motion, (ii) a target motion, and (iii) an edit text, introducing the new MotionFix dataset. Access to this data allows us to train a conditional diffusion model, TMED, that takes both the source motion and the edit text as input. We develop several baselines to evaluate our model, comparing it against models trained solely on text-motion pair datasets, and demonstrate the superior performance of our model trained on triplets. We also introduce new retrieval-based metrics for motion editing, establishing a benchmark on the evaluation set of MotionFix. Our results are promising, paving the way for further research in fine-grained motion generation. Code, models, and data are available at https://motionfix.is.tue.mpg.de/ .
Paper Structure (16 sections, 5 equations, 7 figures, 7 tables)

This paper contains 16 sections, 5 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: Dataset samples: We display source motions (red) overlaid with target motions (green) from our MotionFix dataset, together with their corresponding text annotations.
  • Figure 2: Models overview: (left) We illustrate our TMED model during training. We noise the target motion for $t$ steps, and the transformer model is trained to denoise it back by one step. The conditions -- text and source motion -- are appended to the input. CLIP backbone is frozen, while components denoted in pink are learned during training. At test time, the iterative diffusion process is initialized from random noise instead of the noised target. (right) Our MDM-BP baseline is repurposed from a pretrained text-to-motion generation model to be used only at test time for motion editing. The model is initialized from random noise and the body parts not to be edited according to GPT are copied from the source motion.
  • Figure 3: Guidances of conditions: We illustrate the R@1 performance of TMED for generated-to-target (left) and generated-to-source (right) retrieval benchmarks for $s_L, s_{M_S} \in [1, 5]$.
  • Figure 4: TMED generations: We illustrate several generations from our model with overlaid source (red) and generated (blue) motions. We showcase a variety of test cases ranging from elaborate edits (first example in top left) to short commands (e.g., "mirror"). TMED is able to perform both edits that describe temporal (e.g., "slow down") or spatial (e.g., "raise your arms higher so it is overhead") modifications.
  • Figure 5: Failure cases: We show four failure examples from our model. For each sample, we provide the source motion (red) overlaid both with the generation (blue, left) or the ground-truth target motion (green, right). In the top row, we observe that the model may fail to generate the edited motions when the edit text is detailed and the motions differences are subtle. In the bottom row, although the generated motions follow the edit text, they diverge from the source motions.
  • ...and 2 more figures