Table of Contents
Fetching ...

Learning Semantic Latent Directions for Accurate and Controllable Human Motion Prediction

Guowei Xu, Jiale Tao, Wen Li, Lixin Duan

TL;DR

This work tackles stochastic human motion prediction by addressing the weak guidance of latent distributions in generative models. It introduces Semantic Latent Directions (SLD), an orthogonal latent basis that constrains the latent space and represents future motion as $z=\sum_{m=1}^{M} w_m d_m$, decoded by $\widehat{Y}=G_\phi(X,z)$, while diverse samples are produced via learnable motion queries projected into the SLD space. The method combines an encoder-decoder backbone with a Query-to-Latent Projection (QLP) and a DCT-based preprocessing pipeline, enabling accurate, diverse, and controllable predictions by editing latent coefficients. Extensive experiments on Human3.6M and HumanEva-I demonstrate state-of-the-art accuracy with competitive diversity, and ablations corroborate the benefit of projecting queries into the semantically structured latent space. The work offers a practical, lightweight pathway to semantically disentangled motion representations and controllable SHMP, with code and pretrained models released for reproducibility.

Abstract

In the realm of stochastic human motion prediction (SHMP), researchers have often turned to generative models like GANS, VAEs and diffusion models. However, most previous approaches have struggled to accurately predict motions that are both realistic and coherent with past motion due to a lack of guidance on the latent distribution. In this paper, we introduce Semantic Latent Directions (SLD) as a solution to this challenge, aiming to constrain the latent space to learn meaningful motion semantics and enhance the accuracy of SHMP. SLD defines a series of orthogonal latent directions and represents the hypothesis of future motion as a linear combination of these directions. By creating such an information bottleneck, SLD excels in capturing meaningful motion semantics, thereby improving the precision of motion predictions. Moreover, SLD offers controllable prediction capabilities by adjusting the coefficients of the latent directions during the inference phase. Expanding on SLD, we introduce a set of motion queries to enhance the diversity of predictions. By aligning these motion queries with the SLD space, SLD is further promoted to more accurate and coherent motion predictions. Through extensive experiments conducted on widely used benchmarks, we showcase the superiority of our method in accurately predicting motions while maintaining a balance of realism and diversity. Our code and pretrained models are available at https://github.com/GuoweiXu368/SLD-HMP.

Learning Semantic Latent Directions for Accurate and Controllable Human Motion Prediction

TL;DR

This work tackles stochastic human motion prediction by addressing the weak guidance of latent distributions in generative models. It introduces Semantic Latent Directions (SLD), an orthogonal latent basis that constrains the latent space and represents future motion as , decoded by , while diverse samples are produced via learnable motion queries projected into the SLD space. The method combines an encoder-decoder backbone with a Query-to-Latent Projection (QLP) and a DCT-based preprocessing pipeline, enabling accurate, diverse, and controllable predictions by editing latent coefficients. Extensive experiments on Human3.6M and HumanEva-I demonstrate state-of-the-art accuracy with competitive diversity, and ablations corroborate the benefit of projecting queries into the semantically structured latent space. The work offers a practical, lightweight pathway to semantically disentangled motion representations and controllable SHMP, with code and pretrained models released for reproducibility.

Abstract

In the realm of stochastic human motion prediction (SHMP), researchers have often turned to generative models like GANS, VAEs and diffusion models. However, most previous approaches have struggled to accurately predict motions that are both realistic and coherent with past motion due to a lack of guidance on the latent distribution. In this paper, we introduce Semantic Latent Directions (SLD) as a solution to this challenge, aiming to constrain the latent space to learn meaningful motion semantics and enhance the accuracy of SHMP. SLD defines a series of orthogonal latent directions and represents the hypothesis of future motion as a linear combination of these directions. By creating such an information bottleneck, SLD excels in capturing meaningful motion semantics, thereby improving the precision of motion predictions. Moreover, SLD offers controllable prediction capabilities by adjusting the coefficients of the latent directions during the inference phase. Expanding on SLD, we introduce a set of motion queries to enhance the diversity of predictions. By aligning these motion queries with the SLD space, SLD is further promoted to more accurate and coherent motion predictions. Through extensive experiments conducted on widely used benchmarks, we showcase the superiority of our method in accurately predicting motions while maintaining a balance of realism and diversity. Our code and pretrained models are available at https://github.com/GuoweiXu368/SLD-HMP.
Paper Structure (25 sections, 4 equations, 6 figures, 2 tables)

This paper contains 25 sections, 4 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Traditional methods for stochastic human motion prediction typically involve learning a generative latent distribution without appropriate constraints and guidance. This often results in challenges in acquiring meaningful human motion representations, leading to inaccurate predictions characterized by abnormal poses and incoherent sequences compared to past motion patterns. In contrast, our proposed Semantic Latent Directions (SLD) framework leverages semantic latent directions to steer motion prediction, enabling the generation of future motions with high precision, realism, and coherence with past motion sequences. Moreover, SLD facilitates semantically controllable human motion prediction by adjusting the weights of the semantic latent directions, as illustrated in the bottom part.
  • Figure 2: Overview of the framework: The past human motion is transformed to the frequency domain via DCTdctGSPS. The encoding feature of the past motion, along with the motion query, are merged and mapped into a series of latent coefficients $w = [w_1,..., w_M]$ through the Query to Latent Projection (QLP) module. Semantic codes are derived by integrating the semantic latent directions with the forecasted coefficients. Subsequently, the features of the past human motion and semantic codes are combined to predict future motion.
  • Figure 3: Illustration of different methods on promoting diverse motion sampling. (a) STARS STARS directly combined the motion queries with past motions. (b) We project the motion queries into latent coefficients of the semantic latent directions, ensuring accuracy during the diverse sampling.
  • Figure 4: Qualitative comparison on Human3.6M and HumanEva-I datasets. We emphasize the accurate prediction with solid boxes while inaccurate and abnormal predictions are highlighted with dashed boxes and arrows. Our approach consistently demonstrates accurate, coherent, and diverse predictions.
  • Figure 5: Visualization of controllable motion prediction on the Human3.6M. Semantic control can be achieved by adjusting the coefficients in specific directions. Different degrees of semantic alterations can be attained by varying the magnitude of the coefficient change.
  • ...and 1 more figures