Table of Contents
Fetching ...

StructBiHOI: Structured Articulation Modeling for Long--Horizon Bimanual Hand--Object Interaction Generation

Zhi Wang, Liu Liu, Ruonan Liu, Dan Guo, Meng Wang

TL;DR

This work proposes StructBiHOI, a Structured articulation modeling framework for long-horizon Bimanual HOI generation, to structurally disentangle temporal joint planning from frame--level manipulation refinement, and incorporates a state--space--inspired diffusion denoiser based on Mamba.

Abstract

Recent progress in 3D hand--object interaction (HOI) generation has primarily focused on single--hand grasp synthesis, while bimanual manipulation remains significantly more challenging. Long--horizon planning instability, fine--grained joint articulation, and complex cross--hand coordination make coherent bimanual generation difficult, especially under multimodal conditions. Existing approaches often struggle to simultaneously ensure temporal consistency, physical plausibility, and semantic alignment over extended sequences. We propose StructBiHOI, a Structured articulation modeling framework for long-horizon Bimanual HOI generation. Our key insight is to structurally disentangle temporal joint planning from frame--level manipulation refinement. Specifically, a jointVAE models long-term joint evolution conditioned on object geometry and task semantics, while a maniVAE refines fine-grained hand poses at the single--frame level. To enable stable and efficient long--sequence generation, we incorporate a state--space--inspired diffusion denoiser based on Mamba, which models long--range dependencies with linear complexity. This hierarchical design facilitates coherent dual-hand coordination and articulated object interaction. Extensive experiments on bimanual manipulation and single-hand grasping benchmarks demonstrate that our method achieves superior long--horizon stability, motion realism, and computational efficiency compared to strong baselines.

StructBiHOI: Structured Articulation Modeling for Long--Horizon Bimanual Hand--Object Interaction Generation

TL;DR

This work proposes StructBiHOI, a Structured articulation modeling framework for long-horizon Bimanual HOI generation, to structurally disentangle temporal joint planning from frame--level manipulation refinement, and incorporates a state--space--inspired diffusion denoiser based on Mamba.

Abstract

Recent progress in 3D hand--object interaction (HOI) generation has primarily focused on single--hand grasp synthesis, while bimanual manipulation remains significantly more challenging. Long--horizon planning instability, fine--grained joint articulation, and complex cross--hand coordination make coherent bimanual generation difficult, especially under multimodal conditions. Existing approaches often struggle to simultaneously ensure temporal consistency, physical plausibility, and semantic alignment over extended sequences. We propose StructBiHOI, a Structured articulation modeling framework for long-horizon Bimanual HOI generation. Our key insight is to structurally disentangle temporal joint planning from frame--level manipulation refinement. Specifically, a jointVAE models long-term joint evolution conditioned on object geometry and task semantics, while a maniVAE refines fine-grained hand poses at the single--frame level. To enable stable and efficient long--sequence generation, we incorporate a state--space--inspired diffusion denoiser based on Mamba, which models long--range dependencies with linear complexity. This hierarchical design facilitates coherent dual-hand coordination and articulated object interaction. Extensive experiments on bimanual manipulation and single-hand grasping benchmarks demonstrate that our method achieves superior long--horizon stability, motion realism, and computational efficiency compared to strong baselines.
Paper Structure (17 sections, 13 equations, 4 figures, 4 tables)

This paper contains 17 sections, 13 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Given a continuous action instruction, our framework generates a long-horizon sequence of continuous actions. The method hierarchically disentangles long-term joint planning and frame-level hand articulation, producing coherent and physically plausible manipulation motions.
  • Figure 2: Overview of the proposed StructBiHOI framework. We first train a maniVAE to model frame-level hand--object interactions. A motion-sequence model then generates decoupled latent grasps and global motion. The jointVAE captures object joint articulations and provides a prior for decoding the latent grasps. Finally, the decoded grasps are combined with the global motion to produce the full long-horizon hand--object interaction sequence.
  • Figure 3: The design of our Motion-aware Sequence Model.
  • Figure 4: Qualitative visual comparison between our method and baseline methods.