Table of Contents
Fetching ...

Uni-Inter: Unifying 3D Human Motion Synthesis Across Diverse Interaction Contexts

Sheng Liu, Yuanzhi Liang, Jiepeng Wang, Sidan Du, Chi Zhang, Xuelong Li

TL;DR

Uni-Inter addresses the challenge of generating coherent human motion in compound interaction scenarios by unifying humans, objects, and scenes within a single 3D Representational space called the Unified Interactive Volume (UIV). Motion is modeled as joint-wise spatial distributions over the UIV, transforming generation into spatial inference and enabling robust reasoning about physical constraints, social dynamics, and task semantics. The approach employs a diffusion-based generator conditioned on text and UIV, together with UIV-aligned regularization and a multi-task training regime, achieving competitive or superior results across human-object, human-human, and human-scene benchmarks and demonstrating strong generalization to unseen entity combinations. This unified formulation offers scalable, context-aware motion synthesis for complex, real-world environments with potential applications in character animation, embodied AI, and interactive graphics.

Abstract

We present Uni-Inter, a unified framework for human motion generation that supports a wide range of interaction scenarios: including human-human, human-object, and human-scene-within a single, task-agnostic architecture. In contrast to existing methods that rely on task-specific designs and exhibit limited generalization, Uni-Inter introduces the Unified Interactive Volume (UIV), a volumetric representation that encodes heterogeneous interactive entities into a shared spatial field. This enables consistent relational reasoning and compound interaction modeling. Motion generation is formulated as joint-wise probabilistic prediction over the UIV, allowing the model to capture fine-grained spatial dependencies and produce coherent, context-aware behaviors. Experiments across three representative interaction tasks demonstrate that Uni-Inter achieves competitive performance and generalizes well to novel combinations of entities. These results suggest that unified modeling of compound interactions offers a promising direction for scalable motion synthesis in complex environments.

Uni-Inter: Unifying 3D Human Motion Synthesis Across Diverse Interaction Contexts

TL;DR

Uni-Inter addresses the challenge of generating coherent human motion in compound interaction scenarios by unifying humans, objects, and scenes within a single 3D Representational space called the Unified Interactive Volume (UIV). Motion is modeled as joint-wise spatial distributions over the UIV, transforming generation into spatial inference and enabling robust reasoning about physical constraints, social dynamics, and task semantics. The approach employs a diffusion-based generator conditioned on text and UIV, together with UIV-aligned regularization and a multi-task training regime, achieving competitive or superior results across human-object, human-human, and human-scene benchmarks and demonstrating strong generalization to unseen entity combinations. This unified formulation offers scalable, context-aware motion synthesis for complex, real-world environments with potential applications in character animation, embodied AI, and interactive graphics.

Abstract

We present Uni-Inter, a unified framework for human motion generation that supports a wide range of interaction scenarios: including human-human, human-object, and human-scene-within a single, task-agnostic architecture. In contrast to existing methods that rely on task-specific designs and exhibit limited generalization, Uni-Inter introduces the Unified Interactive Volume (UIV), a volumetric representation that encodes heterogeneous interactive entities into a shared spatial field. This enables consistent relational reasoning and compound interaction modeling. Motion generation is formulated as joint-wise probabilistic prediction over the UIV, allowing the model to capture fine-grained spatial dependencies and produce coherent, context-aware behaviors. Experiments across three representative interaction tasks demonstrate that Uni-Inter achieves competitive performance and generalizes well to novel combinations of entities. These results suggest that unified modeling of compound interactions offers a promising direction for scalable motion synthesis in complex environments.

Paper Structure

This paper contains 12 sections, 14 equations, 6 figures, 9 tables.

Figures (6)

  • Figure 1: Different paradigms for compound interaction motion generation. (a) Existing methods rely on task-specific architectures, resulting in separately modeling when handling compound interactions involving multiple entity types. (b) In contrast, Uni-Inter provides a unified motion generation framework that seamlessly supports arbitrary combinations of interactive entities—including humans, objects, and scenes—within a single model.
  • Figure 2: (a) Uni-Inter supports arbitrary combinations of interactive entities as input and generates corresponding interaction motions. This is enabled by the Unified Interactive Volume (UIV) representation and UIV-aligned regularization. Each interaction entity—whether human, object, or scene—is first encoded as a semantic occupancy grid in the interaction space and then merged into the UIV, which serves as the conditional input to the motion generator. The generator predicts joint-wise spatial distributions guided by the carefully designed UIV-aligned regularization, enabling coherent and context-aware motion synthesis. (b) Illustration of voxel-based representations for different interaction entities, including humans, objects, and the surrounding scene.
  • Figure 3: Qualitative comparison on the Human-Object Interaction dataset. Compared to state-of-the-art method CHOIS li2024controllable, Uni-Inter achieves more precise control and interaction, particularly in hand movements. The blue object represents the conditional input, while the yellow-green person shows the generated motion.
  • Figure 4: Qualitative comparison on the Human-Human Interaction dataset. Compared to ReGenNet xu2024regennet, Uni-Inter demonstrates better spatial alignment of interaction events, resulting in more realistic and context-consistent motion generation. The blue person represents the conditional input, while the yellow-green person shows the generated motion.
  • Figure 5: Qualitative comparison on the Human-Scene Interaction dataset. Compared to the SOTA method Trumans, Uni-Inter shows superior semantic understanding. In the first example, the key instruction is “left hand,” but Trumans incorrectly uses the right hand. In the second example, the key verb is “lie down,” which Trumans fails to execute, highlighting Uni-Inter’s advantage in accurately following semantic cues.
  • ...and 1 more figures