ArtHOI: Articulated Human-Object Interaction Synthesis by 4D Reconstruction from Video Priors

Zihao Huang; Tianqi Liu; Zhaoxi Chen; Shaocong Xu; Saining Zhang; Lixing Xiao; Zhiguo Cao; Wei Li; Hao Zhao; Ziwei Liu

ArtHOI: Articulated Human-Object Interaction Synthesis by 4D Reconstruction from Video Priors

Zihao Huang, Tianqi Liu, Zhaoxi Chen, Shaocong Xu, Saining Zhang, Lixing Xiao, Zhiguo Cao, Wei Li, Hao Zhao, Ziwei Liu

TL;DR

ArtHOI is introduced, the first zero-shot framework for articulated human-object interaction synthesis via 4D reconstruction from video priors, which significantly outperforms prior methods in contact accuracy, penetration reduction, and articulation fidelity.

Abstract

Synthesizing physically plausible articulated human-object interactions (HOI) without 3D/4D supervision remains a fundamental challenge. While recent zero-shot approaches leverage video diffusion models to synthesize human-object interactions, they are largely confined to rigid-object manipulation and lack explicit 4D geometric reasoning. To bridge this gap, we formulate articulated HOI synthesis as a 4D reconstruction problem from monocular video priors: given only a video generated by a diffusion model, we reconstruct a full 4D articulated scene without any 3D supervision. This reconstruction-based approach treats the generated 2D video as supervision for an inverse rendering problem, recovering geometrically consistent and physically plausible 4D scenes that naturally respect contact, articulation, and temporal coherence. We introduce ArtHOI, the first zero-shot framework for articulated human-object interaction synthesis via 4D reconstruction from video priors. Our key designs are: 1) Flow-based part segmentation: leveraging optical flow as a geometric cue to disentangle dynamic from static regions in monocular video; 2) Decoupled reconstruction pipeline: joint optimization of human motion and object articulation is unstable under monocular ambiguity, so we first recover object articulation, then synthesize human motion conditioned on the reconstructed object states. ArtHOI bridges video-based generation and geometry-aware reconstruction, producing interactions that are both semantically aligned and physically grounded. Across diverse articulated scenes (e.g., opening fridges, cabinets, microwaves), ArtHOI significantly outperforms prior methods in contact accuracy, penetration reduction, and articulation fidelity, extending zero-shot interaction synthesis beyond rigid manipulation through reconstruction-informed synthesis.

ArtHOI: Articulated Human-Object Interaction Synthesis by 4D Reconstruction from Video Priors

TL;DR

Abstract

Paper Structure (21 sections, 11 equations, 6 figures, 6 tables, 2 algorithms)

This paper contains 21 sections, 11 equations, 6 figures, 6 tables, 2 algorithms.

Introduction
Related Work
Human-Object Interaction Synthesis
Articulated Object Reconstruction
Video Distillation for 3D Reconstruction
Methodology
Problem Formulation and Overview
Flow-based Part Segmentation
Decoupled Two-Stage Reconstruction
Experiments
Settings
Implementation Details
Interaction Quality Results
Articulated Object Dynamics Results
Rigid Object Results
...and 6 more sections

Figures (6)

Figure 1: ArtHOI recovers zero-shot articulated human-object scene geometry and dynamics from monocular video priors without 3D supervision. Unlike prior works (e.g., TRUMANS, ZeroHSI), our method achieves all four capabilities simultaneously: RGB rendering, articulated object modeling, physical constraint modeling, and zero-shot generalization, notably without using 3D supervision.
Figure 2: ArtHOI synthesizes 3D articulated interactions by reconstructing 4D scenes from monocular video priors. Stage I reconstructs object articulation with kinematic constraints. Stage II refines human motion under the reconstructed geometry.
Figure 3: Key components for articulated interaction under monocular supervision. (a) Back projection maps masks to 3D to identify moving parts. (b) Quasi-static point pairs link dynamic/static regions for kinematic stability. (c) Contact loss projects 2D keypoints into 3D using object depth, guiding human motion without multi-view cues. Ablations in \ref{['fig:ablation']} (middle: (b), right: (c)).
Figure 4: Qualitative comparison of our method with baselines. Our method synthesizes more realistic articulated human-object interactions with proper contact and natural motion coordination. Better inspected in our supplementary video.
Figure 5: Comparing our full model with variants. Better inspected in our supplementary video.
...and 1 more figures

ArtHOI: Articulated Human-Object Interaction Synthesis by 4D Reconstruction from Video Priors

TL;DR

Abstract

ArtHOI: Articulated Human-Object Interaction Synthesis by 4D Reconstruction from Video Priors

Authors

TL;DR

Abstract

Table of Contents

Figures (6)