HOIGPT: Learning Long Sequence Hand-Object Interaction with Language Models

Mingzhen Huang; Fu-Jen Chu; Bugra Tekin; Kevin J Liang; Haoyu Ma; Weiyao Wang; Xingyu Chen; Pierre Gleize; Hongfei Xue; Siwei Lyu; Kris Kitani; Matt Feiszli; Hao Tang

HOIGPT: Learning Long Sequence Hand-Object Interaction with Language Models

Mingzhen Huang, Fu-Jen Chu, Bugra Tekin, Kevin J Liang, Haoyu Ma, Weiyao Wang, Xingyu Chen, Pierre Gleize, Hongfei Xue, Siwei Lyu, Kris Kitani, Matt Feiszli, Hao Tang

TL;DR

HOIGPT addresses the challenge of learning and generating long 3D hand-object interaction sequences conditioned on language. It unifies HOI motion and text via an HOI tokenizer and an HOI-decomposed VQ-VAE coupled to a motion-aware LLM that can serialize and deserialize HOI motion as language tokens. The approach introduces geometry-aware training losses and a factorized codebook design to ensure physically plausible, long-range HOI sequences. Empirical results on ARCTIC and GRAB show state-of-the-art performance on HOI generation and text-to-HOI / HOI-to-text tasks, highlighting the potential for robotics, AR/VR, and human-computer interaction.

Abstract

We introduce HOIGPT, a token-based generative method that unifies 3D hand-object interactions (HOI) perception and generation, offering the first comprehensive solution for captioning and generating high-quality 3D HOI sequences from a diverse range of conditional signals (\eg text, objects, partial sequences). At its core, HOIGPT utilizes a large language model to predict the bidrectional transformation between HOI sequences and natural language descriptions. Given text inputs, HOIGPT generates a sequence of hand and object meshes; given (partial) HOI sequences, HOIGPT generates text descriptions and completes the sequences. To facilitate HOI understanding with a large language model, this paper introduces two key innovations: (1) a novel physically grounded HOI tokenizer, the hand-object decomposed VQ-VAE, for discretizing HOI sequences, and (2) a motion-aware language model trained to process and generate both text and HOI tokens. Extensive experiments demonstrate that HOIGPT sets new state-of-the-art performance on both text generation (+2.01% R Precision) and HOI generation (-2.56 FID) across multiple tasks and benchmarks.

HOIGPT: Learning Long Sequence Hand-Object Interaction with Language Models

TL;DR

Abstract

HOIGPT: Learning Long Sequence Hand-Object Interaction with Language Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)