Table of Contents
Fetching ...

EigenActor: Variant Body-Object Interaction Generation Evolved from Invariant Action Basis Reasoning

Xuehao Gao, Yang Yang, Shaoyi Du, Yang Wu, Yebin Liu, Guo-Jun Qi

TL;DR

This work tackles text-to-HOI synthesis by decomposing HOI reasoning into action-specific motion priors and object-specific interaction priors, solved via a two-stage BodyNet that first infers a canonical action motion and then enriches it with object-aware details, and an ObjectNet that plans object 3D motions with hand-contact guidance. The diffusion-based framework uses CLIP text embeddings and object geometry to steer cross-modal generation, coupled with hand-object contact reasoning and an interaction-optimization module to enhance realism. Experiments on HIMO, FullBodyManipulation, and GRAB show superior semantic consistency, interaction realism, and few-shot robustness compared with state-of-the-art baselines, supported by extensive ablations. The approach advances practical text-to-HOI synthesis by delivering controllable, diverse, and physically plausible body-object co-motions for virtual avatars and interactive scenes.

Abstract

This paper explores a cross-modality synthesis task that infers 3D human-object interactions (HOIs) from a given text-based instruction. Existing text-to-HOI synthesis methods mainly deploy a direct mapping from texts to object-specific 3D body motions, which may encounter a performance bottleneck since the huge cross-modality gap. In this paper, we observe that those HOI samples with the same interaction intention toward different targets, e.g., "lift a chair" and "lift a cup", always encapsulate similar action-specific body motion patterns while characterizing different object-specific interaction styles. Thus, learning effective action-specific motion priors and object-specific interaction priors is crucial for a text-to-HOI model and dominates its performances on text-HOI semantic consistency and body-object interaction realism. In light of this, we propose a novel body pose generation strategy for the text-to-HOI task: infer object-agnostic canonical body action first and then enrich object-specific interaction styles. Specifically, the first canonical body action inference stage focuses on learning intra-class shareable body motion priors and mapping given text-based semantics to action-specific canonical 3D body motions. Then, in the object-specific interaction inference stage, we focus on object affordance learning and enrich object-specific interaction styles on an inferred action-specific body motion basis. Extensive experiments verify that our proposed text-to-HOI synthesis system significantly outperforms other SOTA methods on three large-scale datasets with better semantic consistency and interaction realism performances.

EigenActor: Variant Body-Object Interaction Generation Evolved from Invariant Action Basis Reasoning

TL;DR

This work tackles text-to-HOI synthesis by decomposing HOI reasoning into action-specific motion priors and object-specific interaction priors, solved via a two-stage BodyNet that first infers a canonical action motion and then enriches it with object-aware details, and an ObjectNet that plans object 3D motions with hand-contact guidance. The diffusion-based framework uses CLIP text embeddings and object geometry to steer cross-modal generation, coupled with hand-object contact reasoning and an interaction-optimization module to enhance realism. Experiments on HIMO, FullBodyManipulation, and GRAB show superior semantic consistency, interaction realism, and few-shot robustness compared with state-of-the-art baselines, supported by extensive ablations. The approach advances practical text-to-HOI synthesis by delivering controllable, diverse, and physically plausible body-object co-motions for virtual avatars and interactive scenes.

Abstract

This paper explores a cross-modality synthesis task that infers 3D human-object interactions (HOIs) from a given text-based instruction. Existing text-to-HOI synthesis methods mainly deploy a direct mapping from texts to object-specific 3D body motions, which may encounter a performance bottleneck since the huge cross-modality gap. In this paper, we observe that those HOI samples with the same interaction intention toward different targets, e.g., "lift a chair" and "lift a cup", always encapsulate similar action-specific body motion patterns while characterizing different object-specific interaction styles. Thus, learning effective action-specific motion priors and object-specific interaction priors is crucial for a text-to-HOI model and dominates its performances on text-HOI semantic consistency and body-object interaction realism. In light of this, we propose a novel body pose generation strategy for the text-to-HOI task: infer object-agnostic canonical body action first and then enrich object-specific interaction styles. Specifically, the first canonical body action inference stage focuses on learning intra-class shareable body motion priors and mapping given text-based semantics to action-specific canonical 3D body motions. Then, in the object-specific interaction inference stage, we focus on object affordance learning and enrich object-specific interaction styles on an inferred action-specific body motion basis. Extensive experiments verify that our proposed text-to-HOI synthesis system significantly outperforms other SOTA methods on three large-scale datasets with better semantic consistency and interaction realism performances.

Paper Structure

This paper contains 34 sections, 17 equations, 13 figures, 7 tables.

Figures (13)

  • Figure 1: In the text-to-HOI task, body motion reasoning can be factorized into two sequential stages: action-specific motion inference and object-specific interaction inference. The action-specific canonical body motion inferred from given textual instruction can serve as a primitive action basis for object-specific interaction reasoning.
  • Figure 2: Architecture Overview: (a) We first encapsulate an intra-class canonical pose sequence from category-specific diverse body motion samples. Then, we characterize object-specific interaction styles based on the evolution from actions-specific canonical poses to the body poses interacting with objects; (b) With factorized action-specific motion and object-specific interaction priors, 3D body poses inferred from EigenActor conform to the intended semantics and naturally interact with the object they manipulate. 
  • Figure 3: BodyNet Module Overview. BodyNet factorizes the body pose reasoning task of text-to-HOI into two stages: synthesize action-specific canonical motion first and then enrich it with inferred object-specific interaction styles. With a denoising-based diffusion strategy, action-specific motion diffusion learns the conditional distribution from text-based intended semantics to its intra-class canonical 3D body motions. Object-specific interaction diffusion learns the conditional distribution from text-object joint conditions to body interaction styles.
  • Figure 4: ObjectNet Module Overview. ObjectNet contains three components: contact part inference, object motion diffusion, and hand-object interaction optimization. Contact part inference analyzes object-specific hand-contactable parts for the following object-hand interaction planning. Object motion diffusion infers 3D object movements from inferred body poses and contact parts. Interaction optimization integrates inferred 3D body-object co-movements and improves the realism of the manipulation between them.
  • Figure 5: Qualitative comparison between ours and state-of-the-art methods. We visualize body-object interaction samples synthesized from different given text-object conditions.Top-2 rows (blue bodies) and bottom-2 rows (brown bodies) visualize the HOI samples synthesized from the FullBodyManipulation and GRAB test sets, respectively.
  • ...and 8 more figures