Table of Contents
Fetching ...

CLUTCH: Contextualized Language model for Unlocking Text-Conditioned Hand motion modelling in the wild

Balamurugan Thambiraja, Omid Taheri, Radek Danecek, Giorgio Becherini, Gerard Pons-Moll, Justus Thies

TL;DR

3D Hands in the Wild is introduced, a dataset of 32K 3D hand-motion sequences and aligned text, and CLUTCH, an LLM-based hand animation system with two critical innovations: a novel VQ-VAE architecture to tokenize hand motion and a geometric refinement stage to finetune the LLM.

Abstract

Hands play a central role in daily life, yet modeling natural hand motions remains underexplored. Existing methods that tackle text-to-hand-motion generation or hand animation captioning rely on studio-captured datasets with limited actions and contexts, making them costly to scale to "in-the-wild" settings. Further, contemporary models and their training schemes struggle to capture animation fidelity with text-motion alignment. To address this, we (1) introduce '3D Hands in the Wild' (3D-HIW), a dataset of 32K 3D hand-motion sequences and aligned text, and (2) propose CLUTCH, an LLM-based hand animation system with two critical innovations: (a) SHIFT, a novel VQ-VAE architecture to tokenize hand motion, and (b) a geometric refinement stage to finetune the LLM. To build 3D-HIW, we propose a data annotation pipeline that combines vision-language models (VLMs) and state-of-the-art 3D hand trackers, and apply it to a large corpus of egocentric action videos covering a wide range of scenarios. To fully capture motion in-the-wild, CLUTCH employs SHIFT, a part-modality decomposed VQ-VAE, which improves generalization and reconstruction fidelity. Finally, to improve animation quality, we introduce a geometric refinement stage, where CLUTCH is co-supervised with a reconstruction loss applied directly to decoded hand motion parameters. Experiments demonstrate state-of-the-art performance on text-to-motion and motion-to-text tasks, establishing the first benchmark for scalable in-the-wild hand motion modelling. Code, data and models will be released.

CLUTCH: Contextualized Language model for Unlocking Text-Conditioned Hand motion modelling in the wild

TL;DR

3D Hands in the Wild is introduced, a dataset of 32K 3D hand-motion sequences and aligned text, and CLUTCH, an LLM-based hand animation system with two critical innovations: a novel VQ-VAE architecture to tokenize hand motion and a geometric refinement stage to finetune the LLM.

Abstract

Hands play a central role in daily life, yet modeling natural hand motions remains underexplored. Existing methods that tackle text-to-hand-motion generation or hand animation captioning rely on studio-captured datasets with limited actions and contexts, making them costly to scale to "in-the-wild" settings. Further, contemporary models and their training schemes struggle to capture animation fidelity with text-motion alignment. To address this, we (1) introduce '3D Hands in the Wild' (3D-HIW), a dataset of 32K 3D hand-motion sequences and aligned text, and (2) propose CLUTCH, an LLM-based hand animation system with two critical innovations: (a) SHIFT, a novel VQ-VAE architecture to tokenize hand motion, and (b) a geometric refinement stage to finetune the LLM. To build 3D-HIW, we propose a data annotation pipeline that combines vision-language models (VLMs) and state-of-the-art 3D hand trackers, and apply it to a large corpus of egocentric action videos covering a wide range of scenarios. To fully capture motion in-the-wild, CLUTCH employs SHIFT, a part-modality decomposed VQ-VAE, which improves generalization and reconstruction fidelity. Finally, to improve animation quality, we introduce a geometric refinement stage, where CLUTCH is co-supervised with a reconstruction loss applied directly to decoded hand motion parameters. Experiments demonstrate state-of-the-art performance on text-to-motion and motion-to-text tasks, establishing the first benchmark for scalable in-the-wild hand motion modelling. Code, data and models will be released.
Paper Structure (37 sections, 9 equations, 16 figures, 14 tables)

This paper contains 37 sections, 9 equations, 16 figures, 14 tables.

Figures (16)

  • Figure 1: CLUTCH is a novel LLM-based model that enables text-conditioned synthesis (left) and captioning of in-the-wild 3D hand motions (right).
  • Figure 2: Overview: CLUTCH is an LLM for synthesizing and captioning in-the-wild 3D hand motions. To train this model, we (i) generate an in-the-wild hand motion dataset (\ref{['sec:data_annotation']}). We (ii) tokenize the hand motion using a novel decomposed VQ-VAE tokenizer (\ref{['sec:method_hand_motion_vqvae']}). We (iii) train the LLM to model both text and motion in a unified token space (\ref{['sec:method_llm']}).
  • Figure 3: Data annotation pipeline: We generate motion–text pairs from egocentric videos using a novel automated annotation framework combined with a state-of-the-art hand tracker. Text annotations are produced by first applying Parallel Chain-of-Thought prompting for open-vocabulary reasoning, followed by a closed-vocabulary refinement stage.
  • Figure 4: Example of the two-stage annotation pipeline for an egocentric video (\ref{['fig:dataset_samples']}).
  • Figure 4: Comparison of VQ-VAE configurations.
  • ...and 11 more figures