Table of Contents
Fetching ...

TokenHSI: Unified Synthesis of Physical Human-Scene Interactions through Task Tokenization

Liang Pan, Zeshi Yang, Zhiyang Dou, Wenjia Wang, Buzhen Huang, Bo Dai, Taku Komura, Jingbo Wang

TL;DR

TokenHSI presents a unified, transformer-based controller for diverse HSI tasks by introducing a dedicated proprioception tokenizer and masking-based task integration, enabling multi-skill learning within a single model. The approach supports flexible input lengths and rapid policy adaptation through lightweight adapters and new task tokenizers, facilitating skill composition, object/terrain variation, and long-horizon task completion. Empirical results show TokenHSI achieves higher success rates and better generalization than specialist policies and other adaptation methods, while maintaining efficiency and extensibility across a range of tasks. This work advances practical physics-based humanoid control by providing a scalable, adaptable framework for unified HSI synthesis in dynamic environments.

Abstract

Synthesizing diverse and physically plausible Human-Scene Interactions (HSI) is pivotal for both computer animation and embodied AI. Despite encouraging progress, current methods mainly focus on developing separate controllers, each specialized for a specific interaction task. This significantly hinders the ability to tackle a wide variety of challenging HSI tasks that require the integration of multiple skills, e.g., sitting down while carrying an object. To address this issue, we present TokenHSI, a single, unified transformer-based policy capable of multi-skill unification and flexible adaptation. The key insight is to model the humanoid proprioception as a separate shared token and combine it with distinct task tokens via a masking mechanism. Such a unified policy enables effective knowledge sharing across skills, thereby facilitating the multi-task training. Moreover, our policy architecture supports variable length inputs, enabling flexible adaptation of learned skills to new scenarios. By training additional task tokenizers, we can not only modify the geometries of interaction targets but also coordinate multiple skills to address complex tasks. The experiments demonstrate that our approach can significantly improve versatility, adaptability, and extensibility in various HSI tasks. Website: https://liangpan99.github.io/TokenHSI/

TokenHSI: Unified Synthesis of Physical Human-Scene Interactions through Task Tokenization

TL;DR

TokenHSI presents a unified, transformer-based controller for diverse HSI tasks by introducing a dedicated proprioception tokenizer and masking-based task integration, enabling multi-skill learning within a single model. The approach supports flexible input lengths and rapid policy adaptation through lightweight adapters and new task tokenizers, facilitating skill composition, object/terrain variation, and long-horizon task completion. Empirical results show TokenHSI achieves higher success rates and better generalization than specialist policies and other adaptation methods, while maintaining efficiency and extensibility across a range of tasks. This work advances practical physics-based humanoid control by providing a scalable, adaptable framework for unified HSI synthesis in dynamic environments.

Abstract

Synthesizing diverse and physically plausible Human-Scene Interactions (HSI) is pivotal for both computer animation and embodied AI. Despite encouraging progress, current methods mainly focus on developing separate controllers, each specialized for a specific interaction task. This significantly hinders the ability to tackle a wide variety of challenging HSI tasks that require the integration of multiple skills, e.g., sitting down while carrying an object. To address this issue, we present TokenHSI, a single, unified transformer-based policy capable of multi-skill unification and flexible adaptation. The key insight is to model the humanoid proprioception as a separate shared token and combine it with distinct task tokens via a masking mechanism. Such a unified policy enables effective knowledge sharing across skills, thereby facilitating the multi-task training. Moreover, our policy architecture supports variable length inputs, enabling flexible adaptation of learned skills to new scenarios. By training additional task tokenizers, we can not only modify the geometries of interaction targets but also coordinate multiple skills to address complex tasks. The experiments demonstrate that our approach can significantly improve versatility, adaptability, and extensibility in various HSI tasks. Website: https://liangpan99.github.io/TokenHSI/

Paper Structure

This paper contains 35 sections, 20 equations, 10 figures, 5 tables.

Figures (10)

  • Figure 1: Introducing TokenHSI, a unified model that enables physics-based characters to perform diverse human-scene interaction tasks. It excels at seamlessly unifying multiple foundational HSI skills within a single transformer network and flexibly adapting learned skills to challenging new tasks, including skill composition, object/terrain shape variation, and long-horizon task completion.
  • Figure 2: TokenHSI consists of two stages: (left) foundational skill learning and (right) policy adaptation. Through multi-task policy training, the proposed framework learns versatile interaction skills in a single transformer network. Theses learned skills can be flexibly adapted to more challenging HSI tasks by training the lightweight modules, e.g., $\mathbb{T}^{new}$, $\mathbb{T}^{c}$, and $\xi^{\mathbb{A}}= \{ \xi^{\mathbb{A}}_0, \xi^{\mathbb{A}}_1 \}$.
  • Figure 3: Learning curves comparing the efficiency on skill composition tasks using TokenHSI, policies trained from scratch peng2021amp, CML xu2023composite, and its improved version CML (dual). Colored regions denote mean values $\pm$ a standard deviation based on $3$ models initialized with different random seeds.
  • Figure 4: Through policy adaptation, TokenHSI can generalize learned foundational skills to more challenging scene interaction tasks.
  • Figure 5: Learning curves comparing the efficiency on object shape variation tasks using TokenHSI, full fine-tuning of pre-trained policies, and AdaptNet xu2023adaptnet.
  • ...and 5 more figures