Table of Contents
Fetching ...

TextIM: Part-aware Interactive Motion Synthesis from Text

Siyuan Fan, Bo Du, Xiantao Cai, Bo Peng, Longling Sun

TL;DR

TextIM addresses the challenge of generating TEXT-driven interactive human motions with precise part-level semantics. It introduces a diffusion-based, part-aware framework comprising an interaction-aware module guided by a large language model and a spatial coherence module based on a Part-GCN, enabling alignment between interactive movements and textual intents. The method formulates motion as $x \in \mathbb{R}^{T \times D}$ and employs a conditional forward process $q(\mathbf{x}_t|\mathbf{x}_{t-1})$ and reverse process $p_\theta(\mathbf{x}_{t-1}|\mathbf{x}_t,\mathbf{c})$, predicting $\hat{\mathbf{x}}_0$ rather than the noise, and uses a binary mask $m$ and CLIP guidance to control interactive parts. The results, based on relabeled HUMANML3D data and reinforced with physics-based testing, show improved part-level semantic accuracy and demonstrate applicability to interactions with deformable objects in simulation.

Abstract

In this work, we propose TextIM, a novel framework for synthesizing TEXT-driven human Interactive Motions, with a focus on the precise alignment of part-level semantics. Existing methods often overlook the critical roles of interactive body parts and fail to adequately capture and align part-level semantics, resulting in inaccuracies and even erroneous movement outcomes. To address these issues, TextIM utilizes a decoupled conditional diffusion framework to enhance the detailed alignment between interactive movements and corresponding semantic intents from textual descriptions. Our approach leverages large language models, functioning as a human brain, to identify interacting human body parts and to comprehend interaction semantics to generate complicated and subtle interactive motion. Guided by the refined movements of the interacting parts, TextIM further extends these movements into a coherent whole-body motion. We design a spatial coherence module to complement the entire body movements while maintaining consistency and harmony across body parts using a part graph convolutional network. For training and evaluation, we carefully selected and re-labeled interactive motions from HUMANML3D to develop a specialized dataset. Experimental results demonstrate that TextIM produces semantically accurate human interactive motions, significantly enhancing the realism and applicability of synthesized interactive motions in diverse scenarios, even including interactions with deformable and dynamically changing objects.

TextIM: Part-aware Interactive Motion Synthesis from Text

TL;DR

TextIM addresses the challenge of generating TEXT-driven interactive human motions with precise part-level semantics. It introduces a diffusion-based, part-aware framework comprising an interaction-aware module guided by a large language model and a spatial coherence module based on a Part-GCN, enabling alignment between interactive movements and textual intents. The method formulates motion as and employs a conditional forward process and reverse process , predicting rather than the noise, and uses a binary mask and CLIP guidance to control interactive parts. The results, based on relabeled HUMANML3D data and reinforced with physics-based testing, show improved part-level semantic accuracy and demonstrate applicability to interactions with deformable objects in simulation.

Abstract

In this work, we propose TextIM, a novel framework for synthesizing TEXT-driven human Interactive Motions, with a focus on the precise alignment of part-level semantics. Existing methods often overlook the critical roles of interactive body parts and fail to adequately capture and align part-level semantics, resulting in inaccuracies and even erroneous movement outcomes. To address these issues, TextIM utilizes a decoupled conditional diffusion framework to enhance the detailed alignment between interactive movements and corresponding semantic intents from textual descriptions. Our approach leverages large language models, functioning as a human brain, to identify interacting human body parts and to comprehend interaction semantics to generate complicated and subtle interactive motion. Guided by the refined movements of the interacting parts, TextIM further extends these movements into a coherent whole-body motion. We design a spatial coherence module to complement the entire body movements while maintaining consistency and harmony across body parts using a part graph convolutional network. For training and evaluation, we carefully selected and re-labeled interactive motions from HUMANML3D to develop a specialized dataset. Experimental results demonstrate that TextIM produces semantically accurate human interactive motions, significantly enhancing the realism and applicability of synthesized interactive motions in diverse scenarios, even including interactions with deformable and dynamically changing objects.
Paper Structure (15 sections, 5 equations, 6 figures, 1 table)

This paper contains 15 sections, 5 equations, 6 figures, 1 table.

Figures (6)

  • Figure 1: TextIM generates human interactive motions with part-level semantic accuracy from textual descriptions.
  • Figure 2: TextIM overview. TextIM synthesizes human interactive motions in a decoupled manner to align part-level motions with textual semantics. Given a textual instruction, we employ LLM to extract interaction instructions and body parts to generate the corresponding interaction motion. Subsequently, we use GCN to learn spatial features based on interaction motion to guide the generation of the final result along with the interaction information.
  • Figure 3: Multi-layer adjacency matrices enables intra-part and inter-part feature aggregation. The black loops show intra-part connections of each body parts and the red loop shows the inter-part connection.
  • Figure 4: Semantic accuracy. TextIM accurately generates human motions based on detailed, part-level textual descriptions, ensuring semantic precision in interactive motion generation. The examples illustrate how specifying an interactive body part for the same action can lead to different motion outcomes.
  • Figure 5: Motion combination. TextIM combines existing motions into new ones across both temporal and spatial dimensions, enhancing the data efficiency of interactive motion generation. The examples illustrate how motions can be combined both temporally and spatially to produce varied outcomes.
  • ...and 1 more figures