Table of Contents
Fetching ...

HOIGPT: Learning Long Sequence Hand-Object Interaction with Language Models

Mingzhen Huang, Fu-Jen Chu, Bugra Tekin, Kevin J Liang, Haoyu Ma, Weiyao Wang, Xingyu Chen, Pierre Gleize, Hongfei Xue, Siwei Lyu, Kris Kitani, Matt Feiszli, Hao Tang

TL;DR

HOIGPT addresses the challenge of learning and generating long 3D hand-object interaction sequences conditioned on language. It unifies HOI motion and text via an HOI tokenizer and an HOI-decomposed VQ-VAE coupled to a motion-aware LLM that can serialize and deserialize HOI motion as language tokens. The approach introduces geometry-aware training losses and a factorized codebook design to ensure physically plausible, long-range HOI sequences. Empirical results on ARCTIC and GRAB show state-of-the-art performance on HOI generation and text-to-HOI / HOI-to-text tasks, highlighting the potential for robotics, AR/VR, and human-computer interaction.

Abstract

We introduce HOIGPT, a token-based generative method that unifies 3D hand-object interactions (HOI) perception and generation, offering the first comprehensive solution for captioning and generating high-quality 3D HOI sequences from a diverse range of conditional signals (\eg text, objects, partial sequences). At its core, HOIGPT utilizes a large language model to predict the bidrectional transformation between HOI sequences and natural language descriptions. Given text inputs, HOIGPT generates a sequence of hand and object meshes; given (partial) HOI sequences, HOIGPT generates text descriptions and completes the sequences. To facilitate HOI understanding with a large language model, this paper introduces two key innovations: (1) a novel physically grounded HOI tokenizer, the hand-object decomposed VQ-VAE, for discretizing HOI sequences, and (2) a motion-aware language model trained to process and generate both text and HOI tokens. Extensive experiments demonstrate that HOIGPT sets new state-of-the-art performance on both text generation (+2.01% R Precision) and HOI generation (-2.56 FID) across multiple tasks and benchmarks.

HOIGPT: Learning Long Sequence Hand-Object Interaction with Language Models

TL;DR

HOIGPT addresses the challenge of learning and generating long 3D hand-object interaction sequences conditioned on language. It unifies HOI motion and text via an HOI tokenizer and an HOI-decomposed VQ-VAE coupled to a motion-aware LLM that can serialize and deserialize HOI motion as language tokens. The approach introduces geometry-aware training losses and a factorized codebook design to ensure physically plausible, long-range HOI sequences. Empirical results on ARCTIC and GRAB show state-of-the-art performance on HOI generation and text-to-HOI / HOI-to-text tasks, highlighting the potential for robotics, AR/VR, and human-computer interaction.

Abstract

We introduce HOIGPT, a token-based generative method that unifies 3D hand-object interactions (HOI) perception and generation, offering the first comprehensive solution for captioning and generating high-quality 3D HOI sequences from a diverse range of conditional signals (\eg text, objects, partial sequences). At its core, HOIGPT utilizes a large language model to predict the bidrectional transformation between HOI sequences and natural language descriptions. Given text inputs, HOIGPT generates a sequence of hand and object meshes; given (partial) HOI sequences, HOIGPT generates text descriptions and completes the sequences. To facilitate HOI understanding with a large language model, this paper introduces two key innovations: (1) a novel physically grounded HOI tokenizer, the hand-object decomposed VQ-VAE, for discretizing HOI sequences, and (2) a motion-aware language model trained to process and generate both text and HOI tokens. Extensive experiments demonstrate that HOIGPT sets new state-of-the-art performance on both text generation (+2.01% R Precision) and HOI generation (-2.56 FID) across multiple tasks and benchmarks.

Paper Structure

This paper contains 16 sections, 9 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: HOIGPT can interpret a variety of input prompts for diverse HOI-related tasks. We illustrate examples of text to HOI generation, HOI completion conditioned on object movement, and HOI captioning. HOIGPT generates or interprets hand-object interaction sequences in response to user queries, showcasing its capability to understand and produce contextually relevant HOI motions. The sequences represent time order from left to right.
  • Figure 2: Overview of the HOIGPT framework for bi-directional hand-object interaction (HOI) generation and understanding. The input sequence (left) includes both text and HOI sequences, processed by the text tokenizer and HOI encoder, respectively. The HOI encoder uses an HOI Tokenizer to decompose HOI sequences into object, left hand, and right hand tokens. The language model takes both text and HOI tokens to generate the output sequence, which includes both text descriptions and generated HOI sequences. This design enables seamless integration of text and HOI data for tasks like motion prediction, description, and completion.
  • Figure 3: Overview of HOI-decomposed VQ-VAE. Our framework processes hand and object features through dedicated hand and object encoders, which generate encoded representations. These representations are quantized using separate hand and object codebooks, resulting in corresponding codebook indices for each modality. The quantized indices are combined to form the HOI latent code, which is then decoded through object and hand decoders to reconstruct the HOI sequence. The reconstructed sequence captures realistic hand-object interactions that align closely with the input features. To further enhance physical plausibility, a geometric loss is applied, minimizing interpenetration between the hand and object and ensuring consistent, plausible contact dynamics.
  • Figure 4: Text to HOI generation examples. HOIGPT generates long HOI sequences with multiple complex actions with only text input.
  • Figure 5: Qualitative results of HOIGPT for HOI completion. HOIGPT is designed for multiple tasks including HOI interpolation (top) and HOI prediciton (bottom), the orange line indicts the input HOI sequence.