Table of Contents
Fetching ...

Controllable Long-term Motion Generation with Extended Joint Targets

Eunjong Lee, Eunhee Kim, Sanghoon Hong, Eunho Jung, Jihoon Kim

TL;DR

COMET tackles real-time, controllable long-horizon human motion generation by unifying a Transformer-based conditional VAE with an adaptive joint-control mechanism. A joint-wise attention scheme enables arbitrary subsets of joints to be controlled without retraining, while a reference-guided feedback loop grounds generation in a learned pose manifold to prevent drift. The approach also supports plug-and-play stylization by swapping style GMMs at inference. Empirical results show strong performance on single- and multi-joint control, long-horizon tasks, in-betweening, and stylization, outperforming state-of-the-art baselines and demonstrating real-time viability for interactive applications.

Abstract

Generating stable and controllable character motion in real-time is a key challenge in computer animation. Existing methods often fail to provide fine-grained control or suffer from motion degradation over long sequences, limiting their use in interactive applications. We propose COMET, an autoregressive framework that runs in real time, enabling versatile character control and robust long-horizon synthesis. Our efficient Transformer-based conditional VAE allows for precise, interactive control over arbitrary user-specified joints for tasks like goal-reaching and in-betweening from a single model. To ensure long-term temporal stability, we introduce a novel reference-guided feedback mechanism that prevents error accumulation. This mechanism also serves as a plug-and-play stylization module, enabling real-time style transfer. Extensive evaluations demonstrate that COMET robustly generates high-quality motion at real-time speeds, significantly outperforming state-of-the-art approaches in complex motion control tasks and confirming its readiness for demanding interactive applications.

Controllable Long-term Motion Generation with Extended Joint Targets

TL;DR

COMET tackles real-time, controllable long-horizon human motion generation by unifying a Transformer-based conditional VAE with an adaptive joint-control mechanism. A joint-wise attention scheme enables arbitrary subsets of joints to be controlled without retraining, while a reference-guided feedback loop grounds generation in a learned pose manifold to prevent drift. The approach also supports plug-and-play stylization by swapping style GMMs at inference. Empirical results show strong performance on single- and multi-joint control, long-horizon tasks, in-betweening, and stylization, outperforming state-of-the-art baselines and demonstrating real-time viability for interactive applications.

Abstract

Generating stable and controllable character motion in real-time is a key challenge in computer animation. Existing methods often fail to provide fine-grained control or suffer from motion degradation over long sequences, limiting their use in interactive applications. We propose COMET, an autoregressive framework that runs in real time, enabling versatile character control and robust long-horizon synthesis. Our efficient Transformer-based conditional VAE allows for precise, interactive control over arbitrary user-specified joints for tasks like goal-reaching and in-betweening from a single model. To ensure long-term temporal stability, we introduce a novel reference-guided feedback mechanism that prevents error accumulation. This mechanism also serves as a plug-and-play stylization module, enabling real-time style transfer. Extensive evaluations demonstrate that COMET robustly generates high-quality motion at real-time speeds, significantly outperforming state-of-the-art approaches in complex motion control tasks and confirming its readiness for demanding interactive applications.

Paper Structure

This paper contains 25 sections, 12 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Capabilities of the proposed COMET framework. A single model seamlessly handles both single-joint and multi-joint goal-reaching tasks while sustaining stable, realistic motion throughout long-horizon generation with arbitrary target sets.
  • Figure 2: Architecture of COMET. The model employs a conditional Variational Autoencoder (c-VAE) with Transformer encoder layers. It processes the current state feature ($\mathbf{p}_i$), delta feature ($\mathbf{\delta}_i$), and composite intention feature ($\mathbf{I}_i$). Transformer attention is applied to intention features; It enables adaptive control by allowing the model to conditionally attend only to information from actively controlled joints based on the input controlling joint signal. The system then auto-regressively predicts the subsequent delta feature ($\hat{\mathbf{\delta}}_{i}$), which is a required change to the current pose to achieve goal-reaching task.
  • Figure 3: Reference-guided feedback with GMM components as attractors. The left cluster represents the old style reference, the center cluster represents the drunken style reference, and the right cluster represents the cold style reference. The corrected pose, shown as the point moved by the arrow, is positioned relative to these references.
  • Figure 4: Qualitative Results on Multi Joint Control
  • Figure 5: Qualitative comparison of long-term sequential goal-reaching with and without the proposed Reference-Guided Feedback (RGF). (a) Without RGF, the motion rapidly diverges, exhibiting unnatural drifts and eventual collapse over extended horizons. (b) With RGF enabled, COMET maintains coherent trajectories while accurately reaching all sequential targets.
  • ...and 3 more figures