Table of Contents
Fetching ...

Text2Interact: High-Fidelity and Diverse Text-to-Two-Person Interaction Generation

Qingxuan Wu, Zhiyang Dou, Chuan Guo, Yiming Huang, Qiao Feng, Bing Zhou, Jian Wang, Lingjie Liu

TL;DR

Text2Interact tackles the challenge of text-to-two-person interaction generation by addressing data scarcity and coarse language conditioning. It introduces InterCompose to synthesize diverse two-person motions by composing single-person priors guided by LLM prompts, and InterActor to generate text-faithful, spatiotemporally-coherent interactions using word-level conditioning and an adaptive interaction loss. The approach yields state-of-the-art results on the InterHuman InterGen benchmark in R-Precision and competitive FID/Diversity metrics, with strong generalization to out-of-distribution prompts demonstrated via user studies and ablations. The work validates a scalable, data-efficient pipeline for two-person motion generation and provides reusable code and models to advance research in animation and embodied AI.

Abstract

Modeling human-human interactions from text remains challenging because it requires not only realistic individual dynamics but also precise, text-consistent spatiotemporal coupling between agents. Currently, progress is hindered by 1) limited two-person training data, inadequate to capture the diverse intricacies of two-person interactions; and 2) insufficiently fine-grained text-to-interaction modeling, where language conditioning collapses rich, structured prompts into a single sentence embedding. To address these limitations, we propose our Text2Interact framework, designed to generate realistic, text-aligned human-human interactions through a scalable high-fidelity interaction data synthesizer and an effective spatiotemporal coordination pipeline. First, we present InterCompose, a scalable synthesis-by-composition pipeline that aligns LLM-generated interaction descriptions with strong single-person motion priors. Given a prompt and a motion for an agent, InterCompose retrieves candidate single-person motions, trains a conditional reaction generator for another agent, and uses a neural motion evaluator to filter weak or misaligned samples-expanding interaction coverage without extra capture. Second, we propose InterActor, a text-to-interaction model with word-level conditioning that preserves token-level cues (initiation, response, contact ordering) and an adaptive interaction loss that emphasizes contextually relevant inter-person joint pairs, improving coupling and physical plausibility for fine-grained interaction modeling. Extensive experiments show consistent gains in motion diversity, fidelity, and generalization, including out-of-distribution scenarios and user studies. We will release code and models to facilitate reproducibility.

Text2Interact: High-Fidelity and Diverse Text-to-Two-Person Interaction Generation

TL;DR

Text2Interact tackles the challenge of text-to-two-person interaction generation by addressing data scarcity and coarse language conditioning. It introduces InterCompose to synthesize diverse two-person motions by composing single-person priors guided by LLM prompts, and InterActor to generate text-faithful, spatiotemporally-coherent interactions using word-level conditioning and an adaptive interaction loss. The approach yields state-of-the-art results on the InterHuman InterGen benchmark in R-Precision and competitive FID/Diversity metrics, with strong generalization to out-of-distribution prompts demonstrated via user studies and ablations. The work validates a scalable, data-efficient pipeline for two-person motion generation and provides reusable code and models to advance research in animation and embodied AI.

Abstract

Modeling human-human interactions from text remains challenging because it requires not only realistic individual dynamics but also precise, text-consistent spatiotemporal coupling between agents. Currently, progress is hindered by 1) limited two-person training data, inadequate to capture the diverse intricacies of two-person interactions; and 2) insufficiently fine-grained text-to-interaction modeling, where language conditioning collapses rich, structured prompts into a single sentence embedding. To address these limitations, we propose our Text2Interact framework, designed to generate realistic, text-aligned human-human interactions through a scalable high-fidelity interaction data synthesizer and an effective spatiotemporal coordination pipeline. First, we present InterCompose, a scalable synthesis-by-composition pipeline that aligns LLM-generated interaction descriptions with strong single-person motion priors. Given a prompt and a motion for an agent, InterCompose retrieves candidate single-person motions, trains a conditional reaction generator for another agent, and uses a neural motion evaluator to filter weak or misaligned samples-expanding interaction coverage without extra capture. Second, we propose InterActor, a text-to-interaction model with word-level conditioning that preserves token-level cues (initiation, response, contact ordering) and an adaptive interaction loss that emphasizes contextually relevant inter-person joint pairs, improving coupling and physical plausibility for fine-grained interaction modeling. Extensive experiments show consistent gains in motion diversity, fidelity, and generalization, including out-of-distribution scenarios and user studies. We will release code and models to facilitate reproducibility.

Paper Structure

This paper contains 36 sections, 2 equations, 14 figures, 3 tables.

Figures (14)

  • Figure 1: (a) Our generative two-person motion composition framework, InterCompose, synthesizes plausible and diverse interactions from generated textual descriptions and a single-person motion condition (yellow). (b) Our interaction generation framework InterActor generates high-quality and plausible interactions faithful to text. A deeper color indicates a later time.
  • Figure 2: Overview of the proposed frameworks. (a) InterCompose: sample interaction and single-person descriptions via an LLM, generate a single-person motion from a motion prior guo2024momask, then compose the second agent with a reaction model conditioned on the two-person prompt and the motion prior. (b) InterActor: an $N$-block generator with word-level conditioning and motion–motion interaction. Each block cross-attends motion tokens to CLIP word tokens radford2021learning, followed by self-attention and inter-agent cross-attention to model individual motion and interactions.
  • Figure 3: Qualitative comparisons of interaction generation results from InterActor and InterMask javed2024intermask. Our method produces results with better text-motion alignment and is more robust to implausible poses. A deeper color indicates a later time.
  • Figure 4: Qualitative samples of InterCompose. Prompts are synthesized by an LLM liu2024deepseek. The yellow is synthesized by the single-person motion generator, while the blue is generated by the reaction model with the yellow as the condition. A deeper color indicates a later time.
  • Figure 5: User preference study results of InterActor with and without fine-tuning on synthetic data.
  • ...and 9 more figures