Table of Contents
Fetching ...

SimMotionEdit: Text-Based Human Motion Editing with Motion Similarity Prediction

Zhengyuan Li, Kai Cheng, Anindita Ghosh, Uttaran Bhattacharya, Liangyan Gui, Aniket Bera

TL;DR

SimMotionEdit tackles text-based 3D human motion editing by introducing a Motion Diffusion Transformer that jointly performs editing and a motion similarity prediction auxiliary task. The condition transformer enhances text and source-motion features, while the diffusion transformer denoises edited motions under a DDPM framework, guided by the auxiliary similarity objective and an AdaLN-Zero conditioned text signal. On MotionFix, the approach achieves state-of-the-art performance in both alignment to text and source motion and motion realism, with ablations confirming the benefit of the auxiliary task and feature augmentation. This work advances precise, text-driven motion editing for animation pipelines by learning semantically meaningful representations that bridge language and motion semantics.

Abstract

Text-based 3D human motion editing is a critical yet challenging task in computer vision and graphics. While training-free approaches have been explored, the recent release of the MotionFix dataset, which includes source-text-motion triplets, has opened new avenues for training, yielding promising results. However, existing methods struggle with precise control, often leading to misalignment between motion semantics and language instructions. In this paper, we introduce a related task, motion similarity prediction, and propose a multi-task training paradigm, where we train the model jointly on motion editing and motion similarity prediction to foster the learning of semantically meaningful representations. To complement this task, we design an advanced Diffusion-Transformer-based architecture that separately handles motion similarity prediction and motion editing. Extensive experiments demonstrate the state-of-the-art performance of our approach in both editing alignment and fidelity.

SimMotionEdit: Text-Based Human Motion Editing with Motion Similarity Prediction

TL;DR

SimMotionEdit tackles text-based 3D human motion editing by introducing a Motion Diffusion Transformer that jointly performs editing and a motion similarity prediction auxiliary task. The condition transformer enhances text and source-motion features, while the diffusion transformer denoises edited motions under a DDPM framework, guided by the auxiliary similarity objective and an AdaLN-Zero conditioned text signal. On MotionFix, the approach achieves state-of-the-art performance in both alignment to text and source motion and motion realism, with ablations confirming the benefit of the auxiliary task and feature augmentation. This work advances precise, text-driven motion editing for animation pipelines by learning semantically meaningful representations that bridge language and motion semantics.

Abstract

Text-based 3D human motion editing is a critical yet challenging task in computer vision and graphics. While training-free approaches have been explored, the recent release of the MotionFix dataset, which includes source-text-motion triplets, has opened new avenues for training, yielding promising results. However, existing methods struggle with precise control, often leading to misalignment between motion semantics and language instructions. In this paper, we introduce a related task, motion similarity prediction, and propose a multi-task training paradigm, where we train the model jointly on motion editing and motion similarity prediction to foster the learning of semantically meaningful representations. To complement this task, we design an advanced Diffusion-Transformer-based architecture that separately handles motion similarity prediction and motion editing. Extensive experiments demonstrate the state-of-the-art performance of our approach in both editing alignment and fidelity.

Paper Structure

This paper contains 28 sections, 11 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Text-Based Motion Editing. Our method SimMotionEdit generates edited human motion sequences from text instructions and source motion sequences.
  • Figure 2: Overview of SimMotionEdit. (a) The architecture consists of two modules: the condition transformer and the diffusion transformer. The condition transformer performs the auxiliary task of motion similarity prediction and enables the source motion features and the text features to mix. The diffusion transformer takes in the enhanced text features, the embedded diffusion step $t$ as the condition, the noisy edited motion, and the enriched source motion features, and predicts the denoised edited motions. (b) The auxiliary task motion similarity prediction is inspired by the fact that, given the text instructions, the similarity between source and edited motions is predictable. We use blue for the source motion, red for the edited motion, and orange for the generated motion.
  • Figure 3: Raw Motion Similarity. We translate the global positions of sampled poses of the source motion and the edited motion for a clear view.
  • Figure 4: Qualitative Results. We compare our method with TMED athanasiou2024motionfix. Our method outperforms TMED in terms of both fidelity and alignment with source motion and text instructions.
  • Figure A.1: Perceptual Study Layout.(Upper Part) We show an example of our study layout with one source motion, one edit instruction, and one edited motion. (Lower Part) We show the scoring instructions and scoring area for all the samples in the perceptual study.
  • ...and 2 more figures