Table of Contents
Fetching ...

TLControl: Trajectory and Language Control for Human Motion Synthesis

Weilin Wan, Zhiyang Dou, Taku Komura, Wenping Wang, Dinesh Jayaraman, Lingjie Liu

TL;DR

<1>The paper tackles controllable human motion synthesis by jointly leveraging language descriptions $L$ and partial trajectories $ extbf{R'}$ to generate full-body motions $ extbf{J} \,\in\, \mathbb{R}^{T \times M}$.</1>The paper introduces TLControl, a framework that combines a part-based VQ-VAE to learn a structured latent space, a Masked Trajectory Transformer conditioned on $L$ and $ extbf{R'}$ to produce a coarse latent seed, and a test-time optimization in latent space to align with trajectories.</2>The contributions include six-body-part grouping in the VQ-VAE, CLIP-based language conditioning in the MTT with trajectory masking, and a flexible latent-space optimization that enables precise trajectory alignment with high efficiency.</3>The experiments on KIT-ML and HumanML3D demonstrate superior trajectory fidelity and faster runtimes compared with state-of-the-art methods, validating TLControl’s practicality for interactive high-quality animation.

Abstract

Controllable human motion synthesis is essential for applications in AR/VR, gaming and embodied AI. Existing methods often focus solely on either language or full trajectory control, lacking precision in synthesizing motions aligned with user-specified trajectories, especially for multi-joint control. To address these issues, we present TLControl, a novel method for realistic human motion synthesis, incorporating both low-level Trajectory and high-level Language semantics controls, through the integration of neural-based and optimization-based techniques. Specifically, we begin with training a VQ-VAE for a compact and well-structured latent motion space organized by body parts. We then propose a Masked Trajectories Transformer (MTT) for predicting a motion distribution conditioned on language and trajectory. Once trained, we use MTT to sample initial motion predictions given user-specified partial trajectories and text descriptions as conditioning. Finally, we introduce a test-time optimization to refine these coarse predictions for precise trajectory control, which offers flexibility by allowing users to specify various optimization goals and ensures high runtime efficiency. Comprehensive experiments show that TLControl significantly outperforms the state-of-the-art in trajectory accuracy and time efficiency, making it practical for interactive and high-quality animation generation.

TLControl: Trajectory and Language Control for Human Motion Synthesis

TL;DR

<1>The paper tackles controllable human motion synthesis by jointly leveraging language descriptions and partial trajectories to generate full-body motions .</1>The paper introduces TLControl, a framework that combines a part-based VQ-VAE to learn a structured latent space, a Masked Trajectory Transformer conditioned on and to produce a coarse latent seed, and a test-time optimization in latent space to align with trajectories.</2>The contributions include six-body-part grouping in the VQ-VAE, CLIP-based language conditioning in the MTT with trajectory masking, and a flexible latent-space optimization that enables precise trajectory alignment with high efficiency.</3>The experiments on KIT-ML and HumanML3D demonstrate superior trajectory fidelity and faster runtimes compared with state-of-the-art methods, validating TLControl’s practicality for interactive high-quality animation.

Abstract

Controllable human motion synthesis is essential for applications in AR/VR, gaming and embodied AI. Existing methods often focus solely on either language or full trajectory control, lacking precision in synthesizing motions aligned with user-specified trajectories, especially for multi-joint control. To address these issues, we present TLControl, a novel method for realistic human motion synthesis, incorporating both low-level Trajectory and high-level Language semantics controls, through the integration of neural-based and optimization-based techniques. Specifically, we begin with training a VQ-VAE for a compact and well-structured latent motion space organized by body parts. We then propose a Masked Trajectories Transformer (MTT) for predicting a motion distribution conditioned on language and trajectory. Once trained, we use MTT to sample initial motion predictions given user-specified partial trajectories and text descriptions as conditioning. Finally, we introduce a test-time optimization to refine these coarse predictions for precise trajectory control, which offers flexibility by allowing users to specify various optimization goals and ensures high runtime efficiency. Comprehensive experiments show that TLControl significantly outperforms the state-of-the-art in trajectory accuracy and time efficiency, making it practical for interactive and high-quality animation generation.
Paper Structure (27 sections, 4 equations, 15 figures, 8 tables)

This paper contains 27 sections, 4 equations, 15 figures, 8 tables.

Figures (15)

  • Figure 1: TLControl, a novel method for Trajectory and Language Control for Human Motion Synthesis. The corresponding control joints are highlighted in orange. Our method demonstrates versatile multi-joint controls (see Figures 1a to 1c), the ability to handle complex trajectories (see Figure 1d), multi-stage control (see Figure 1e), and the preservation of language semantics while utilizing trajectory controls (see Figure 1f). The dotted lines represent the input control trajectories defined by users through specifying the parameters of analytical shapes. Note the input trajectories can also be hand drawings from users, or parameters from environment settings (See Fig. \ref{['fig:qualitative_2']}). We highly encourage readers to view our supplementary video to see our results.
  • Figure 1: Per batch running time statistics of our embedding comparing to the unsplit embedding. “Upper Body” includes the joints of the hands and the head joint. “Lower Body” includes the joints of two feet and the joint of the pelvis.
  • Figure 2: Overview of TLControl framework: At training stage I, we train the part-based VQ-VAE in \ref{['subsec:vqvae']} for reconstructing human motions. In training stage II, the decoder of the part-based VQ-VAE is frozen and we train the masked trajectory transformer (MTT) in \ref{['subsec:transformer']} for predicting code indices from control inputs. Finally, at test time, the MTT receives text description and partial control trajectories to predict an initial VQ-VAE quantized code seed, which is refined by run-time optimization as in \ref{['subsec:Optimization']} before decoding with the VQ-VAE into full body motions.remaking.
  • Figure 2: Influence of different trajectory incompleteness. We simulate the incompleteness by applying random masking. The left vertical axis represents the FID metric, while the right vertical axis indicates the R-precision metric.
  • Figure 3: Qualitative results of TLControl using user-defined trajectories. Figure 3a and Figure 3d demonstrate that our method enables separate controls using language and joint-level trajectories. Figure 3b and Figure 3c showcase the capability of our method to manage multi-joint control simultaneously. Please refer to our supplementary for more qualitative results
  • ...and 10 more figures