Table of Contents
Fetching ...

OpenHOI: Open-World Hand-Object Interaction Synthesis with Multimodal Large Language Model

Zhenhao Zhang, Ye Shi, Lingxiao Yang, Suting Ni, Qi Ye, Jingya Wang

TL;DR

OpenHOI tackles open-world hand-object interaction synthesis by marrying a 3D multimodal LLM trained for affordance grounding and instruction decomposition with an affordance-driven diffusion generator and a training-free physics refinement. It enables long-horizon HOI sequences for unseen objects under open-vocabulary instructions, addressing generalization gaps in prior closed-set methods. Key innovations include a <AFF> affordance token, coarse-to-fine grounding, and a refinement-sampling loop that enforces physical plausibility and temporal coherence without distribution shift. Extensive experiments on GRAB and ARCTIC demonstrate state-of-the-art generalization to novel objects and complex language prompts, with ablations validating each component's contribution.

Abstract

Understanding and synthesizing realistic 3D hand-object interactions (HOI) is critical for applications ranging from immersive AR/VR to dexterous robotics. Existing methods struggle with generalization, performing well on closed-set objects and predefined tasks but failing to handle unseen objects or open-vocabulary instructions. We introduce OpenHOI, the first framework for open-world HOI synthesis, capable of generating long-horizon manipulation sequences for novel objects guided by free-form language commands. Our approach integrates a 3D Multimodal Large Language Model (MLLM) fine-tuned for joint affordance grounding and semantic task decomposition, enabling precise localization of interaction regions (e.g., handles, buttons) and breakdown of complex instructions (e.g., "Find a water bottle and take a sip") into executable sub-tasks. To synthesize physically plausible interactions, we propose an affordance-driven diffusion model paired with a training-free physics refinement stage that minimizes penetration and optimizes affordance alignment. Evaluations across diverse scenarios demonstrate OpenHOI's superiority over state-of-the-art methods in generalizing to novel object categories, multi-stage tasks, and complex language instructions. Our project page at \href{https://openhoi.github.io}

OpenHOI: Open-World Hand-Object Interaction Synthesis with Multimodal Large Language Model

TL;DR

OpenHOI tackles open-world hand-object interaction synthesis by marrying a 3D multimodal LLM trained for affordance grounding and instruction decomposition with an affordance-driven diffusion generator and a training-free physics refinement. It enables long-horizon HOI sequences for unseen objects under open-vocabulary instructions, addressing generalization gaps in prior closed-set methods. Key innovations include a <AFF> affordance token, coarse-to-fine grounding, and a refinement-sampling loop that enforces physical plausibility and temporal coherence without distribution shift. Extensive experiments on GRAB and ARCTIC demonstrate state-of-the-art generalization to novel objects and complex language prompts, with ablations validating each component's contribution.

Abstract

Understanding and synthesizing realistic 3D hand-object interactions (HOI) is critical for applications ranging from immersive AR/VR to dexterous robotics. Existing methods struggle with generalization, performing well on closed-set objects and predefined tasks but failing to handle unseen objects or open-vocabulary instructions. We introduce OpenHOI, the first framework for open-world HOI synthesis, capable of generating long-horizon manipulation sequences for novel objects guided by free-form language commands. Our approach integrates a 3D Multimodal Large Language Model (MLLM) fine-tuned for joint affordance grounding and semantic task decomposition, enabling precise localization of interaction regions (e.g., handles, buttons) and breakdown of complex instructions (e.g., "Find a water bottle and take a sip") into executable sub-tasks. To synthesize physically plausible interactions, we propose an affordance-driven diffusion model paired with a training-free physics refinement stage that minimizes penetration and optimizes affordance alignment. Evaluations across diverse scenarios demonstrate OpenHOI's superiority over state-of-the-art methods in generalizing to novel object categories, multi-stage tasks, and complex language instructions. Our project page at \href{https://openhoi.github.io}

Paper Structure

This paper contains 60 sections, 27 equations, 6 figures, 16 tables.

Figures (6)

  • Figure 1: Motivation: OpenHOI introduces an open-world framework for generating HOI sequences that demonstrates strong generalization across seen and unseen objects, high-level instructions, and long-horizon tasks.
  • Figure 2: Pipeline: Our framework comprises two sequential components. First, a 3D multimodal large language model (3D MLLM) ingests high-level instructions and object point clouds to generate sequential affordance maps and decompose the high-level task into a sequence of sub-tasks. Second, the diffusion model takes the affordance map and the decomposed task sequence as conditions to synthesize realistic hand-object interaction sequences.
  • Figure 3: Qualitative result: The visualization results showcase three types of long-horizon sequences—seen-object, unseen-object, and multi-object. The experiments demonstrate that our method exhibits strong generalization on both unseen objects and open-vocabulary instructions, enabling open-world HOI sequence synthesis.
  • Figure A1: Visualization on Affordance
  • Figure A2: Qualitative results on seen object
  • ...and 1 more figures