Table of Contents
Fetching ...

Contact-aware Human Motion Generation from Textual Descriptions

Sihan Ma, Qiong Cao, Jing Zhang, Dacheng Tao

TL;DR

This work tackles the problem of generating 3D interactive human motion from textual descriptions that specify body-part contacts with static objects. It introduces the RICH-CAT dataset and the CATMO framework, which separately encodes motion and contact with VQ-VAEs and uses an intertwined GPT conditioned on an interaction-aware text encoder to produce consistent, contact-aware motions that are decoded into plausible 3D poses. Key contributions include the new dataset with vertex-level contact labels, the dual-VQ-VAE plus intertwined GPT architecture, an alignment mechanism with a pretrained text encoder, and demonstrated applicability to HOI synthesis in static scenes. The approach advances realistic, controllable text-to-motion generation with explicit contact reasoning, enabling broader applications in robotics, AR/VR, and interactive scenes.

Abstract

This paper addresses the problem of generating 3D interactive human motion from text. Given a textual description depicting the actions of different body parts in contact with static objects, we synthesize sequences of 3D body poses that are visually natural and physically plausible. Yet, this task poses a significant challenge due to the inadequate consideration of interactions by physical contacts in both motion and textual descriptions, leading to unnatural and implausible sequences. To tackle this challenge, we create a novel dataset named RICH-CAT, representing "Contact-Aware Texts" constructed from the RICH dataset. RICH-CAT comprises high-quality motion, accurate human-object contact labels, and detailed textual descriptions, encompassing over 8,500 motion-text pairs across 26 indoor/outdoor actions. Leveraging RICH-CAT, we propose a novel approach named CATMO for text-driven interactive human motion synthesis that explicitly integrates human body contacts as evidence. We employ two VQ-VAE models to encode motion and body contact sequences into distinct yet complementary latent spaces and an intertwined GPT for generating human motions and contacts in a mutually conditioned manner. Additionally, we introduce a pre-trained text encoder to learn textual embeddings that better discriminate among various contact types, allowing for more precise control over synthesized motions and contacts. Our experiments demonstrate the superior performance of our approach compared to existing text-to-motion methods, producing stable, contact-aware motion sequences. Code and data will be available for research purposes at https://xymsh.github.io/RICH-CAT/

Contact-aware Human Motion Generation from Textual Descriptions

TL;DR

This work tackles the problem of generating 3D interactive human motion from textual descriptions that specify body-part contacts with static objects. It introduces the RICH-CAT dataset and the CATMO framework, which separately encodes motion and contact with VQ-VAEs and uses an intertwined GPT conditioned on an interaction-aware text encoder to produce consistent, contact-aware motions that are decoded into plausible 3D poses. Key contributions include the new dataset with vertex-level contact labels, the dual-VQ-VAE plus intertwined GPT architecture, an alignment mechanism with a pretrained text encoder, and demonstrated applicability to HOI synthesis in static scenes. The approach advances realistic, controllable text-to-motion generation with explicit contact reasoning, enabling broader applications in robotics, AR/VR, and interactive scenes.

Abstract

This paper addresses the problem of generating 3D interactive human motion from text. Given a textual description depicting the actions of different body parts in contact with static objects, we synthesize sequences of 3D body poses that are visually natural and physically plausible. Yet, this task poses a significant challenge due to the inadequate consideration of interactions by physical contacts in both motion and textual descriptions, leading to unnatural and implausible sequences. To tackle this challenge, we create a novel dataset named RICH-CAT, representing "Contact-Aware Texts" constructed from the RICH dataset. RICH-CAT comprises high-quality motion, accurate human-object contact labels, and detailed textual descriptions, encompassing over 8,500 motion-text pairs across 26 indoor/outdoor actions. Leveraging RICH-CAT, we propose a novel approach named CATMO for text-driven interactive human motion synthesis that explicitly integrates human body contacts as evidence. We employ two VQ-VAE models to encode motion and body contact sequences into distinct yet complementary latent spaces and an intertwined GPT for generating human motions and contacts in a mutually conditioned manner. Additionally, we introduce a pre-trained text encoder to learn textual embeddings that better discriminate among various contact types, allowing for more precise control over synthesized motions and contacts. Our experiments demonstrate the superior performance of our approach compared to existing text-to-motion methods, producing stable, contact-aware motion sequences. Code and data will be available for research purposes at https://xymsh.github.io/RICH-CAT/
Paper Structure (38 sections, 7 equations, 6 figures, 5 tables)

This paper contains 38 sections, 7 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: We address the problem of text-driven 3D interactive human motion generation from both data and algorithmic perspectives. RICH-CAT is a novel dataset featuring (a) high-quality motion, (b) accurate contact labels, and (c) interactive textual descriptions that specify different body parts interacting with various static objects. Using it, we introduce a novel approach named CATMO to learn the complex human motion dynamics that incorporate interaction semantics provided by contact, producing natural and plausible 3D human motions of action (d).
  • Figure 2: Text annotation pipeline. When presented with a sequence of paired (motion, contact) data and its corresponding 3D scene mesh from RICH rich dataset, our annotation pipeline produces a templated description containing information about actions and interaction details by (a) generating annotations for individual frames, (b) aggregating across mulitple frames, and finally (c) automatically generating textual descriptions.
  • Figure 3: Architecture of our approach CATMO for text-driven 3D interactive human motion synthesis. Our model consists of independent (a) Motion VQ-VAE and (b) Contact VQ-VAE to encode the motion and contact modalities into distinct latent spaces. Subsequently, we autoregressively predict a distribution of motion and contact from the text via (c) the intertwined GPT to explicitly incorporate contact into motion generation. The output from the intertwined GPT is then fed into the learned Motion VQ-VAE decoder $\mathcal{D}$ to yield a sequence of 3D poses with physically plausible interactions. Additionally, the text embedding is extracted from our pretrained text encoder $\mathcal{E}_t$, with an alignment loss ensuring the consistency between interactive text embeddings and the generated poses. $\mathcal{E}_m$ is the movement encoder pretrained with the text encoder $\mathcal{E}_t$ to calculate the alignment loss.
  • Figure 4: Architecture comparison between Intertwined GPT and ordinary GPTs. Standard GPT either predicts only motion (a), or parallel motion and contact tokens sequentially (b). Instead, Intertwined GPT (c) predicts them in a cross-conditioned manner, allowing contact-aware motion generation
  • Figure 5: Qualitative comparison with the state-of-the-art methods on RICH-CAT test set. We compare (e) our method with (b) MDM mdm, (c) MLD mld, (d) T2M-GPT t2mgpt. Part of (a) ground-truth motion is provided for reference.
  • ...and 1 more figures