Table of Contents
Fetching ...

InterEdit: Navigating Text-Guided Multi-Human 3D Motion Editing

Yebin Yang, Di Wen, Lei Qi, Weitong Kong, Junwei Zheng, Ruiping Liu, Yufan Chen, Chengzhi Wu, Kailun Yang, Yuqian Fu, Danda Pani Paudel, Luc Van Gool, Kunyu Peng

Abstract

Text-guided 3D motion editing has seen success in single-person scenarios, but its extension to multi-person settings is less explored due to limited paired data and the complexity of inter-person interactions. We introduce the task of multi-person 3D motion editing, where a target motion is generated from a source and a text instruction. To support this, we propose InterEdit3D, a new dataset with manual two-person motion change annotations, and a Text-guided Multi-human Motion Editing (TMME) benchmark. We present InterEdit, a synchronized classifier-free conditional diffusion model for TMME. It introduces Semantic-Aware Plan Token Alignment with learnable tokens to capture high-level interaction cues and an Interaction-Aware Frequency Token Alignment strategy using DCT and energy pooling to model periodic motion dynamics. Experiments show that InterEdit improves text-to-motion consistency and edit fidelity, achieving state-of-the-art TMME performance. The dataset and code will be released at https://github.com/YNG916/InterEdit.

InterEdit: Navigating Text-Guided Multi-Human 3D Motion Editing

Abstract

Text-guided 3D motion editing has seen success in single-person scenarios, but its extension to multi-person settings is less explored due to limited paired data and the complexity of inter-person interactions. We introduce the task of multi-person 3D motion editing, where a target motion is generated from a source and a text instruction. To support this, we propose InterEdit3D, a new dataset with manual two-person motion change annotations, and a Text-guided Multi-human Motion Editing (TMME) benchmark. We present InterEdit, a synchronized classifier-free conditional diffusion model for TMME. It introduces Semantic-Aware Plan Token Alignment with learnable tokens to capture high-level interaction cues and an Interaction-Aware Frequency Token Alignment strategy using DCT and energy pooling to model periodic motion dynamics. Experiments show that InterEdit improves text-to-motion consistency and edit fidelity, achieving state-of-the-art TMME performance. The dataset and code will be released at https://github.com/YNG916/InterEdit.
Paper Structure (42 sections, 29 equations, 8 figures, 10 tables)

This paper contains 42 sections, 29 equations, 8 figures, 10 tables.

Figures (8)

  • Figure 1: An illustration of (a) Text-guided Multi-human 3D Motion Editing (TMME) task and our proposed InterEdit model, and (b) the performances of baselines (, MotionFix athanasiou2024motionfix, MotionLab guo2025motionlab, InterGen liang2024intergen, TIMotion wang2025TIMotion) and our InterEdit.
  • Figure 2: Overview of the proposed InterEdit framework. Given a two-person motion and an editing instruction, InterEdit uses a conditional diffusion backbone with symmetric interleaved motion tokens. It introduces (i) Semantic-Aware Plan Token Alignment for high-level editing guidance via a motion-teacher embedding, and (ii) Interaction-Aware Frequency Token Alignment using DCT-based band-energy descriptors to regulate interaction dynamics.
  • Figure 3: Qualitative results comparison of our InterEdit and TIMotion wang2025TIMotion.
  • Figure 4: Qualitative results comparison under custom prompts.
  • Figure 5: Dataset statistics of InterEdit3D. (a) Coverage of semantic dimensions (Spatial, Temporal, Action-change, Body-part, Whole-body), showing dominance of spatial and temporal edits. (b) Word cloud and Top-50 distribution, highlighting frequent interaction-related and spatial terms.
  • ...and 3 more figures