GHOST: Grounded Human Motion Generation with Open Vocabulary Scene-and-Text Contexts
Zoltán Á. Milacski, Koichiro Niinuma, Ryosuke Kawamura, Fernando de la Torre, László A. Jeni
TL;DR
This work addresses the challenge of grounding text and 3D scene context in human motion generation by replacing a closed vocabulary scene encoder with an open vocabulary grounding approach. It introduces GHOST, a two-step framework that first distills knowledge from open vocabulary segmentation to align 3D scene representations with text in CLIP space, then fine-tunes the scene encoder with grounding regularizers that emphasize the goal object's category and size. The method yields substantial improvements on the HUMANISE benchmark, achieving up to 30% reduction in goal-object distance and favorable perceptual judgments across three teacher models. The results highlight the practical potential for accurate, text-driven motion generation in diverse environments, while also suggesting avenues for improvement through diffusion models and broader grounding targets.
Abstract
The connection between our 3D surroundings and the descriptive language that characterizes them would be well-suited for localizing and generating human motion in context but for one problem. The complexity introduced by multiple modalities makes capturing this connection challenging with a fixed set of descriptors. Specifically, closed vocabulary scene encoders, which require learning text-scene associations from scratch, have been favored in the literature, often resulting in inaccurate motion grounding. In this paper, we propose a method that integrates an open vocabulary scene encoder into the architecture, establishing a robust connection between text and scene. Our two-step approach starts with pretraining the scene encoder through knowledge distillation from an existing open vocabulary semantic image segmentation model, ensuring a shared text-scene feature space. Subsequently, the scene encoder is fine-tuned for conditional motion generation, incorporating two novel regularization losses that regress the category and size of the goal object. Our methodology achieves up to a 30% reduction in the goal object distance metric compared to the prior state-of-the-art baseline model on the HUMANISE dataset. This improvement is demonstrated through evaluations conducted using three implementations of our framework and a perceptual study. Additionally, our method is designed to seamlessly accommodate future 2D segmentation methods that provide per-pixel text-aligned features for distillation.
