Table of Contents
Fetching ...

THOM: Generating Physically Plausible Hand-Object Meshes From Text

Uyoung Jeong, Yihalem Yimolal Tiruneh, Hyung Jin Chang, Seungryul Baek, Kwang In Kim

Abstract

The generation of 3D hand-object interactions (HOIs) from text is crucial for dexterous robotic grasping and VR/AR content generation, requiring both high visual fidelity and physical plausibility. Nevertheless, the ill-posed problem of mesh extraction from text-generated Gaussians, and physics-based optimization on the erroneous meshes pose challenges. To address these issues, we introduce THOM, a training-free framework that generates photorealistic, physically plausible 3D HOI meshes without the need for a template object mesh. THOM employs a two-stage pipeline, initially generating the hand and object Gaussians, followed by physics-based HOI optimization. Our new mesh extraction method and vertex-to-Gaussian mapping explicitly assign Gaussian elements to mesh vertices, allowing topology-aware regularization. Furthermore, we improve the physical plausibility of interactions by VLM-guided translation refinement and contact-aware optimization. Comprehensive experiments demonstrate that THOM consistently surpasses state-of-the-art methods in terms of text alignment, visual realism, and interaction plausibility.

THOM: Generating Physically Plausible Hand-Object Meshes From Text

Abstract

The generation of 3D hand-object interactions (HOIs) from text is crucial for dexterous robotic grasping and VR/AR content generation, requiring both high visual fidelity and physical plausibility. Nevertheless, the ill-posed problem of mesh extraction from text-generated Gaussians, and physics-based optimization on the erroneous meshes pose challenges. To address these issues, we introduce THOM, a training-free framework that generates photorealistic, physically plausible 3D HOI meshes without the need for a template object mesh. THOM employs a two-stage pipeline, initially generating the hand and object Gaussians, followed by physics-based HOI optimization. Our new mesh extraction method and vertex-to-Gaussian mapping explicitly assign Gaussian elements to mesh vertices, allowing topology-aware regularization. Furthermore, we improve the physical plausibility of interactions by VLM-guided translation refinement and contact-aware optimization. Comprehensive experiments demonstrate that THOM consistently surpasses state-of-the-art methods in terms of text alignment, visual realism, and interaction plausibility.

Paper Structure

This paper contains 48 sections, 13 equations, 11 figures, 11 tables.

Figures (11)

  • Figure 1: Overall pipeline of THOM. In the first stage, object and hand Gaussians are separately generated from the text prompts. During the second stage, we jointly refine the HOI Gaussians and HOI parameters. At HOI initialization, we refine the hand translation with VLM-guided refinement. During HOI parameters optimization, we introduce physics-based optimization via distance-adaptive contact losses and reposition loss. For both the first and the second stage, Laplacian regularization is applied for topological consistency.
  • Figure 2: Qualitative comparisons of ProlificDreamer wang2023prolificdreamer, GaussianDreamerPro yi2024gaussiandreamerpro, Hash3D yang2025hash3d and ours.
  • Figure 3: Qualitative comparison with human-object interaction methods. *: Adapted to generate hand-object interactions.
  • Figure 4: VLM refinement results. Left: "A right hand calling a smartphone". Middle: "A right hand using a hammer". Right: "A right hand inspecting Paddington Bear".
  • Figure 5: User preference study for THOM.
  • ...and 6 more figures