Table of Contents
Fetching ...

Unleashing Guidance Without Classifiers for Human-Object Interaction Animation

Ziyin Wang, Sirui Xu, Chuan Guo, Bing Zhou, Jiangshan Gong, Jian Wang, Yu-Xiong Wang, Liang-Yan Gui

Abstract

Generating realistic human-object interaction (HOI) animations remains challenging because it requires jointly modeling dynamic human actions and diverse object geometries. Prior diffusion-based approaches often rely on hand-crafted contact priors or human-imposed kinematic constraints to improve contact quality. We propose LIGHT, a data-driven alternative in which guidance emerges from the denoising pace itself, reducing dependence on manually designed priors. Building on diffusion forcing, we factor the representation into modality-specific components and assign individualized noise levels with asynchronous denoising schedules. In this paradigm, cleaner components guide noisier ones through cross-attention, yielding guidance without auxiliary classifiers. We find that this data-driven guidance is inherently contact-aware, and can be enhanced when training is augmented with a broad spectrum of synthetic object geometries, encouraging invariance of contact semantics to geometric diversity. Extensive experiments show that pace-induced guidance more effectively mirrors the benefits of contact priors than conventional classifier-free guidance, while achieving higher contact fidelity, more realistic HOI generation, and stronger generalization to unseen objects and tasks.

Unleashing Guidance Without Classifiers for Human-Object Interaction Animation

Abstract

Generating realistic human-object interaction (HOI) animations remains challenging because it requires jointly modeling dynamic human actions and diverse object geometries. Prior diffusion-based approaches often rely on hand-crafted contact priors or human-imposed kinematic constraints to improve contact quality. We propose LIGHT, a data-driven alternative in which guidance emerges from the denoising pace itself, reducing dependence on manually designed priors. Building on diffusion forcing, we factor the representation into modality-specific components and assign individualized noise levels with asynchronous denoising schedules. In this paradigm, cleaner components guide noisier ones through cross-attention, yielding guidance without auxiliary classifiers. We find that this data-driven guidance is inherently contact-aware, and can be enhanced when training is augmented with a broad spectrum of synthetic object geometries, encouraging invariance of contact semantics to geometric diversity. Extensive experiments show that pace-induced guidance more effectively mirrors the benefits of contact priors than conventional classifier-free guidance, while achieving higher contact fidelity, more realistic HOI generation, and stronger generalization to unseen objects and tasks.

Paper Structure

This paper contains 20 sections, 12 equations, 8 figures, 11 tables, 1 algorithm.

Figures (8)

  • Figure 1: Given a textual prompt, our LIGHT generates realistic, vivid human-object interaction (HOI) motions via a novel classifier-free guidance scheme.
  • Figure 2: Overview of LIGHT.Left: Training. We form different modalities, e.g., body, hand, and object, each diffused with its own noise level. After adding modal-wise and frame-wise rotary positional encodings, the tokens are processed by a shared Transformer decoder and an MLP head to predict clean motion. Right: Inference. We compare a uniform schedule that denoises all modalities synchronously with a staged schedule that keeps one modality cleaner from the uniform run.
  • Figure 3: Qualitative comparison with baselines. Our method yields more realistic human-object interactions, fewer contact/penetration artifacts, more accurate finger positioning, and better text-motion alignment.
  • Figure 4: Qualitative comparison between our method using body and hand merged into a single token (left) versus separating body and hand into distinct tokens (right). Unrealistic grasping artifacts produced by the single-token approach are highlighted in red dashed boxes. Our separate-token strategy yields better results.
  • Figure 5: Qualitative comparison.Left: our LIGHT without guidance. Right: our full method with guidance, which markedly enhances generation quality.
  • ...and 3 more figures