Table of Contents
Fetching ...

BimArt: A Unified Approach for the Synthesis of 3D Bimanual Interaction with Articulated Objects

Wanyue Zhang, Rishabh Dabral, Vladislav Golyanik, Vasileios Choutas, Eduardo Alvarado, Thabo Beeler, Marc Habermann, Christian Theobalt

TL;DR

BimArt tackles the challenge of generating realistic 3D bimanual hand interactions with articulated objects given object trajectories. It introduces a three-stage pipeline: (i) an articulation-aware canonical object representation based on part-based Basis Point Sets, (ii) a diffusion-based Bimanual Contact Generation Model that produces left and right contact maps, and (iii) a diffusion-based Bimanual Hand Motion Model guided by these contacts, followed by MANO-based optimization to ensure physical plausibility. The key contributions include a unified, category-agnostic object representation, a generative contact prior for articulated objects, and a contact-guided motion synthesis framework that yields high diversity and plausibility, validated on ARCTIC and HOI4D datasets. The approach achieves state-of-the-art performance in interaction plausibility and diversity, enabling artists and researchers to synthesize controllable, realistic hand-object animations for articulated objects. While demonstrated on a fixed set of object categories, the framework points toward zero-shot generalization and faster sampling as promising future directions.

Abstract

We present BimArt, a novel generative approach for synthesizing 3D bimanual hand interactions with articulated objects. Unlike prior works, we do not rely on a reference grasp, a coarse hand trajectory, or separate modes for grasping and articulating. To achieve this, we first generate distance-based contact maps conditioned on the object trajectory with an articulation-aware feature representation, revealing rich bimanual patterns for manipulation. The learned contact prior is then used to guide our hand motion generator, producing diverse and realistic bimanual motions for object movement and articulation. Our work offers key insights into feature representation and contact prior for articulated objects, demonstrating their effectiveness in taming the complex, high-dimensional space of bimanual hand-object interactions. Through comprehensive quantitative experiments, we demonstrate a clear step towards simplified and high-quality hand-object animations that surpass the state of the art in motion quality and diversity. Project page: https://vcai.mpi-inf.mpg.de/projects/bimart/.

BimArt: A Unified Approach for the Synthesis of 3D Bimanual Interaction with Articulated Objects

TL;DR

BimArt tackles the challenge of generating realistic 3D bimanual hand interactions with articulated objects given object trajectories. It introduces a three-stage pipeline: (i) an articulation-aware canonical object representation based on part-based Basis Point Sets, (ii) a diffusion-based Bimanual Contact Generation Model that produces left and right contact maps, and (iii) a diffusion-based Bimanual Hand Motion Model guided by these contacts, followed by MANO-based optimization to ensure physical plausibility. The key contributions include a unified, category-agnostic object representation, a generative contact prior for articulated objects, and a contact-guided motion synthesis framework that yields high diversity and plausibility, validated on ARCTIC and HOI4D datasets. The approach achieves state-of-the-art performance in interaction plausibility and diversity, enabling artists and researchers to synthesize controllable, realistic hand-object animations for articulated objects. While demonstrated on a fixed set of object categories, the framework points toward zero-shot generalization and faster sampling as promising future directions.

Abstract

We present BimArt, a novel generative approach for synthesizing 3D bimanual hand interactions with articulated objects. Unlike prior works, we do not rely on a reference grasp, a coarse hand trajectory, or separate modes for grasping and articulating. To achieve this, we first generate distance-based contact maps conditioned on the object trajectory with an articulation-aware feature representation, revealing rich bimanual patterns for manipulation. The learned contact prior is then used to guide our hand motion generator, producing diverse and realistic bimanual motions for object movement and articulation. Our work offers key insights into feature representation and contact prior for articulated objects, demonstrating their effectiveness in taming the complex, high-dimensional space of bimanual hand-object interactions. Through comprehensive quantitative experiments, we demonstrate a clear step towards simplified and high-quality hand-object animations that surpass the state of the art in motion quality and diversity. Project page: https://vcai.mpi-inf.mpg.de/projects/bimart/.

Paper Structure

This paper contains 24 sections, 12 equations, 9 figures, 6 tables.

Figures (9)

  • Figure 1: Overview of the proposed approach. BimArt takes $N$ frames of object trajectories as input and generates $N$ frames of 3D bimanual interactions. The object features (articulation-aware BPS features $\mathbf{O}$, 6D global states $\mathbf{G}$, and the object scale $s_{\mathrm{o}}$) are passed into both the object encoder $\mathcal{E}_o$ (MLP) in the contact generation model and $\mathcal{E}_{\alpha}$ (MLP) in the motion generation model. Additionally, the motion generation model's contact encoder $\mathcal{E}_c$ takes $\mathbf{C}$, the bimanual contact map produced by the contact generation model, as conditioning input. The contact model and motion model are both denoising diffusion models, and the spiral denotes the denoising process. $\mathbf{C}$ is further used as guidance at each diffusion timestep to align hand motions with the generated contact maps. Finally, we use optimization to correct contact and penetration artifacts and obtain 3D bimanual meshes.
  • Figure 2: Hand Representation: We parameterize each frame of hand pose by using $\mathtt{J}$ surface keypoints (in orange), sampled from the surface of the hand. In addition to position, we also use the direction vector (dark blue lines) from each keypoint to the nearest object surface as an additional feature.
  • Figure 3: Different BPS Sampling Strategies. Top left: $\mathtt{K} \times 2$ basis points sampled uniformly within a 0.5-meter radius for unnormalized objects. Top middle: $\mathtt{K} \times 2$ BPS sampled uniformly in a unit ball for normalized objects. Top right: $\mathtt{K}$ basis points sampled uniformly in a unit ball for normalized objects, with points mapped to each articulated part of the object, maintaining the same feature dimension. Bottom: Green points on the object represent the projections of the BPS feature vectors. The proposed Normalized Part BPS provides denser mapping on the object's inner surface layer.
  • Figure 4: Qualitative Comparison. MDM-B struggles with establishing accurate contact, as seen in the hand-object gap in the scissors and the box example. OMOMO-B's rigid contact constraints make it prone to failure, especially with large wrist movements, like opening a box. CAMS-B failed to generate plausible motions, since its stage-wise contact targets under-constrain MANO fitting in dynamic settings with complex contact patterns and diverse object trajectories.
  • Figure 5: Diverse Results. We show diverse bimanual sequences together with the predicted contact maps on the laptop, ketchup, and mixer given the same unseen trajectory per object. Our method generates accurate finger placements guided by the predicted contact maps.
  • ...and 4 more figures