Table of Contents
Fetching ...

InterMoE: Individual-Specific 3D Human Interaction Generation via Dynamic Temporal-Selective MoE

Lipeng Wang, Hongxing Fan, Haohua Chen, Zehuan Huang, Lu Sheng

TL;DR

InterMoE tackles the challenge of generating high-fidelity, individual-specific 3D human interactions conditioned on text by introducing a dynamic temporal-selective mixture of experts. The framework combines a Synergistic Router that fuses text semantics with motion context and a Dynamic Temporal Selection mechanism that lets experts dynamically allocate capacity to salient temporal features during diffusion denoising. Empirically, it achieves state-of-the-art results on InterHuman and InterX, improving FID and R-Precision while preserving distinctive identities, and it generalizes to single-human motion generation. This modular, diffusion-based approach advances text-driven motion synthesis for interactive applications in VR and robotics and suggests broader applicability to complex multi-agent generation tasks.

Abstract

Generating high-quality human interactions holds significant value for applications like virtual reality and robotics. However, existing methods often fail to preserve unique individual characteristics or fully adhere to textual descriptions. To address these challenges, we introduce InterMoE, a novel framework built on a Dynamic Temporal-Selective Mixture of Experts. The core of InterMoE is a routing mechanism that synergistically uses both high-level text semantics and low-level motion context to dispatch temporal motion features to specialized experts. This allows experts to dynamically determine the selection capacity and focus on critical temporal features, thereby preserving specific individual characteristic identities while ensuring high semantic fidelity. Extensive experiments show that InterMoE achieves state-of-the-art performance in individual-specific high-fidelity 3D human interaction generation, reducing FID scores by 9% on the InterHuman dataset and 22% on InterX.

InterMoE: Individual-Specific 3D Human Interaction Generation via Dynamic Temporal-Selective MoE

TL;DR

InterMoE tackles the challenge of generating high-fidelity, individual-specific 3D human interactions conditioned on text by introducing a dynamic temporal-selective mixture of experts. The framework combines a Synergistic Router that fuses text semantics with motion context and a Dynamic Temporal Selection mechanism that lets experts dynamically allocate capacity to salient temporal features during diffusion denoising. Empirically, it achieves state-of-the-art results on InterHuman and InterX, improving FID and R-Precision while preserving distinctive identities, and it generalizes to single-human motion generation. This modular, diffusion-based approach advances text-driven motion synthesis for interactive applications in VR and robotics and suggests broader applicability to complex multi-agent generation tasks.

Abstract

Generating high-quality human interactions holds significant value for applications like virtual reality and robotics. However, existing methods often fail to preserve unique individual characteristics or fully adhere to textual descriptions. To address these challenges, we introduce InterMoE, a novel framework built on a Dynamic Temporal-Selective Mixture of Experts. The core of InterMoE is a routing mechanism that synergistically uses both high-level text semantics and low-level motion context to dispatch temporal motion features to specialized experts. This allows experts to dynamically determine the selection capacity and focus on critical temporal features, thereby preserving specific individual characteristic identities while ensuring high semantic fidelity. Extensive experiments show that InterMoE achieves state-of-the-art performance in individual-specific high-fidelity 3D human interaction generation, reducing FID scores by 9% on the InterHuman dataset and 22% on InterX.

Paper Structure

This paper contains 40 sections, 11 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: Compared with conventional MoE mechanisms, Token-Choice inaccurately generates the "extends" action, and Expert-Choice has low overall kinematic quality. Our framework leverages the Synergistic Router and Dynamic Temporal Selection mechanism to generate 3D human interactions that exhibit both high semantic fidelity and robust preservation of individual-specific characteristics.
  • Figure 2: The overall framework of the InterMoE. (a) Causal-Skeletal VAE to encode/decode individual motions; (b) Two Cooperative MoE Denoisers to interactively perform denoising; (c) Our proposed Synergistic Router and Dynamic Temporal-Selective Expert mechanism. The router guides multiple experts to select and process critical temporal features of the motion sequence dynamically.
  • Figure 3: Qualitative comparisons with TIMotion wang2025TIMotion and InterMask javedintermask. Arrowed lines mark the trajectories of motion, Red circles indicate key actions that align with the text, and Purple boxes highlight the identity confusion error.
  • Figure 4: Qualitative results to verify key components of our InterMoE.
  • Figure 5: Qualitative comparisons among the different MoE types. Arrowed lines mark the trajectories of motion.
  • ...and 3 more figures