Table of Contents
Fetching ...

GHOST: Grounded Human Motion Generation with Open Vocabulary Scene-and-Text Contexts

Zoltán Á. Milacski, Koichiro Niinuma, Ryosuke Kawamura, Fernando de la Torre, László A. Jeni

TL;DR

This work addresses the challenge of grounding text and 3D scene context in human motion generation by replacing a closed vocabulary scene encoder with an open vocabulary grounding approach. It introduces GHOST, a two-step framework that first distills knowledge from open vocabulary segmentation to align 3D scene representations with text in CLIP space, then fine-tunes the scene encoder with grounding regularizers that emphasize the goal object's category and size. The method yields substantial improvements on the HUMANISE benchmark, achieving up to 30% reduction in goal-object distance and favorable perceptual judgments across three teacher models. The results highlight the practical potential for accurate, text-driven motion generation in diverse environments, while also suggesting avenues for improvement through diffusion models and broader grounding targets.

Abstract

The connection between our 3D surroundings and the descriptive language that characterizes them would be well-suited for localizing and generating human motion in context but for one problem. The complexity introduced by multiple modalities makes capturing this connection challenging with a fixed set of descriptors. Specifically, closed vocabulary scene encoders, which require learning text-scene associations from scratch, have been favored in the literature, often resulting in inaccurate motion grounding. In this paper, we propose a method that integrates an open vocabulary scene encoder into the architecture, establishing a robust connection between text and scene. Our two-step approach starts with pretraining the scene encoder through knowledge distillation from an existing open vocabulary semantic image segmentation model, ensuring a shared text-scene feature space. Subsequently, the scene encoder is fine-tuned for conditional motion generation, incorporating two novel regularization losses that regress the category and size of the goal object. Our methodology achieves up to a 30% reduction in the goal object distance metric compared to the prior state-of-the-art baseline model on the HUMANISE dataset. This improvement is demonstrated through evaluations conducted using three implementations of our framework and a perceptual study. Additionally, our method is designed to seamlessly accommodate future 2D segmentation methods that provide per-pixel text-aligned features for distillation.

GHOST: Grounded Human Motion Generation with Open Vocabulary Scene-and-Text Contexts

TL;DR

This work addresses the challenge of grounding text and 3D scene context in human motion generation by replacing a closed vocabulary scene encoder with an open vocabulary grounding approach. It introduces GHOST, a two-step framework that first distills knowledge from open vocabulary segmentation to align 3D scene representations with text in CLIP space, then fine-tunes the scene encoder with grounding regularizers that emphasize the goal object's category and size. The method yields substantial improvements on the HUMANISE benchmark, achieving up to 30% reduction in goal-object distance and favorable perceptual judgments across three teacher models. The results highlight the practical potential for accurate, text-driven motion generation in diverse environments, while also suggesting avenues for improvement through diffusion models and broader grounding targets.

Abstract

The connection between our 3D surroundings and the descriptive language that characterizes them would be well-suited for localizing and generating human motion in context but for one problem. The complexity introduced by multiple modalities makes capturing this connection challenging with a fixed set of descriptors. Specifically, closed vocabulary scene encoders, which require learning text-scene associations from scratch, have been favored in the literature, often resulting in inaccurate motion grounding. In this paper, we propose a method that integrates an open vocabulary scene encoder into the architecture, establishing a robust connection between text and scene. Our two-step approach starts with pretraining the scene encoder through knowledge distillation from an existing open vocabulary semantic image segmentation model, ensuring a shared text-scene feature space. Subsequently, the scene encoder is fine-tuned for conditional motion generation, incorporating two novel regularization losses that regress the category and size of the goal object. Our methodology achieves up to a 30% reduction in the goal object distance metric compared to the prior state-of-the-art baseline model on the HUMANISE dataset. This improvement is demonstrated through evaluations conducted using three implementations of our framework and a perceptual study. Additionally, our method is designed to seamlessly accommodate future 2D segmentation methods that provide per-pixel text-aligned features for distillation.
Paper Structure (29 sections, 3 equations, 5 figures, 5 tables)

This paper contains 29 sections, 3 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Comparison of our proposed GHOST cVAE method with the prior state-of-the-art HUMANISE cVAE wang2022humanise in text-and-scene-conditional human motion generation. Best viewed in color. (a) The HUMANISE cVAE exhibits a bias towards generating motions centered within the scene. (b) In contrast, our GHOST cVAE demonstrates superior semantic understanding and achieves higher action performance. (c) The three implementations of our GHOST framework exhibit approximately $1.5\times$ to $3.9\times$ larger parameter counts (indicated by dot radii) than the HUMANISE cVAE. All of our three variants outperform the baseline in two text-scene grounding metrics.
  • Figure 2: Overview of our idea. Best viewed in color. We compare our GHOST cVAE with the HUMANISE cVAE wang2022humanise model. The major differences are in the text and 3D scene point cloud representations, grounding and regularization. (a) The HUMANISE cVAE architecture utilizes a closed vocabulary scene encoder producing a finite set of labels, resulting in a misalignment with the open vocabulary text feature space. This requires the fusion module to learn grounding from scratch. Grounding is regularized by regressing the center point of the goal object. (b) In contrast, our GHOST cVAE architecture employs a shared open vocabulary vision-language space for both modalities, establishing initial grounding between them. We regularize grounding by classifying and regressing the bounding box corners of the goal object, increasing awareness for category and size.
  • Figure 3: Schematic diagram of the pretraining and training phases of our proposed GHOST framework for text-and-scene-conditional human motion generation. (a) Pretraining involves maximizing the cosine similarity between our scene point cloud encoder and corresponding text-aligned 2D viewpoint pixel features, computed by an open vocabulary image segmentation teacher model. This ensures that our features align with text embeddings in a shared space. We use a Point Transformer U-Net scene encoder. (b) Training employs a Conditional Variational Autoencoder (cVAE) architecture for motion generation, conditioned on both text and scene encoder outputs. The pretrained scene encoder weights are fine-tuned with two novel regularization losses (goal object bounding box regression and classification) to improve grounding. The rest of the components of the model remains consistent with the original HUMANISE cVAE wang2022humanise model.
  • Figure 4: Qualitative generation results of the agnostic all-actions models on the HUMANISE dataset. We display 6 samples for each text, with 3 generated by each model. Ground truth goal objects are highlighted in red, and accompanying attention maps are depicted with purple camera frustums. Our GHOST model places the character significantly closer to the goal than the HUMANISE cVAE baseline.
  • Figure 5: Qualitative generation results of ablation on the walk action subset of the HUMANISE dataset. We display 3 samples for the same text, with 1 generated by each model. Ground truth goal object is highlighted in red. Our GHOST model places the character significantly closer to the goal with our proposed regularization losses.