Table of Contents
Fetching ...

MotionMap: Representing Multimodality in Human Pose Forecasting

Reyhaneh Hosseininejad, Megh Shukla, Saeed Saadatnejad, Mathieu Salzmann, Alexandre Alahi

TL;DR

This work tackles the inherent multimodality of human pose forecasting by reframing it as a well-posed problem: for each observed pose sequence, only a finite set of future motions present in the training data are considered. It introduces MotionMap, a heatmap-based representation that encodes a variable number of future modes as local maxima in a 2D space, with a codebook mapping heatmap locations to latent futures and enabling efficient mode coverage. The approach combines a two-stage pipeline (autoencoder for transitions and MotionMap for multimodal forecasting) with an uncertainty decomposition that separates mode-level confidence from mode-conditioned prediction uncertainty. Empirically, MotionMap achieves strong multimodal recall and ranking on Human3.6M and AMASS, while offering controllability via action labels and improved sample efficiency compared to prior methods, highlighting practical benefits for safe and user-guided pose forecasting.

Abstract

Human pose forecasting is inherently multimodal since multiple futures exist for an observed pose sequence. However, evaluating multimodality is challenging since the task is ill-posed. Therefore, we first propose an alternative paradigm to make the task well-posed. Next, while state-of-the-art methods predict multimodality, this requires oversampling a large volume of predictions. This raises key questions: (1) Can we capture multimodality by efficiently sampling a smaller number of predictions? (2) Subsequently, which of the predicted futures is more likely for an observed pose sequence? We address these questions with MotionMap, a simple yet effective heatmap based representation for multimodality. We extend heatmaps to represent a spatial distribution over the space of all possible motions, where different local maxima correspond to different forecasts for a given observation. MotionMap can capture a variable number of modes per observation and provide confidence measures for different modes. Further, MotionMap allows us to introduce the notion of uncertainty and controllability over the forecasted pose sequence. Finally, MotionMap captures rare modes that are non-trivial to evaluate yet critical for safety. We support our claims through multiple qualitative and quantitative experiments using popular 3D human pose datasets: Human3.6M and AMASS, highlighting the strengths and limitations of our proposed method. Project Page: https://vita-epfl.github.io/MotionMap

MotionMap: Representing Multimodality in Human Pose Forecasting

TL;DR

This work tackles the inherent multimodality of human pose forecasting by reframing it as a well-posed problem: for each observed pose sequence, only a finite set of future motions present in the training data are considered. It introduces MotionMap, a heatmap-based representation that encodes a variable number of future modes as local maxima in a 2D space, with a codebook mapping heatmap locations to latent futures and enabling efficient mode coverage. The approach combines a two-stage pipeline (autoencoder for transitions and MotionMap for multimodal forecasting) with an uncertainty decomposition that separates mode-level confidence from mode-conditioned prediction uncertainty. Empirically, MotionMap achieves strong multimodal recall and ranking on Human3.6M and AMASS, while offering controllability via action labels and improved sample efficiency compared to prior methods, highlighting practical benefits for safe and user-guided pose forecasting.

Abstract

Human pose forecasting is inherently multimodal since multiple futures exist for an observed pose sequence. However, evaluating multimodality is challenging since the task is ill-posed. Therefore, we first propose an alternative paradigm to make the task well-posed. Next, while state-of-the-art methods predict multimodality, this requires oversampling a large volume of predictions. This raises key questions: (1) Can we capture multimodality by efficiently sampling a smaller number of predictions? (2) Subsequently, which of the predicted futures is more likely for an observed pose sequence? We address these questions with MotionMap, a simple yet effective heatmap based representation for multimodality. We extend heatmaps to represent a spatial distribution over the space of all possible motions, where different local maxima correspond to different forecasts for a given observation. MotionMap can capture a variable number of modes per observation and provide confidence measures for different modes. Further, MotionMap allows us to introduce the notion of uncertainty and controllability over the forecasted pose sequence. Finally, MotionMap captures rare modes that are non-trivial to evaluate yet critical for safety. We support our claims through multiple qualitative and quantitative experiments using popular 3D human pose datasets: Human3.6M and AMASS, highlighting the strengths and limitations of our proposed method. Project Page: https://vita-epfl.github.io/MotionMap

Paper Structure

This paper contains 25 sections, 1 equation, 14 figures, 3 tables, 1 algorithm.

Figures (14)

  • Figure 1: MotionMap uses heatmaps to depict a spatial distribution over the space of motions. Local maxima imply that the corresponding motions have a higher likelihood of being a future motion for an observed pose sequence. MotionMap not only predicts a variable number of modes with the corresponding confidence, but it explicitly encodes rare modes that could otherwise be averaged out.
  • Figure 2: We define a two stage pipeline for human pose forecasting. At first, we train a framework similar to an autoencoder to predict the ground truth and future motion (Sec: 4.3). However, at test time we do not have the future motion and its latent as input. Therefore, we train a heatmap model to predict MotionMap, which along with the codebook encodes the likely motions and their latents as a drop-in replacement (Sec: 4.4). During fine-tuning and at inference time, we use the predicted MotionMap to obtain latents corresponding to motions with a high confidence and use it in tandem with the observed pose sequence to predict the future pose sequence (Sec: 4.5)
  • Figure 3: The current approach to finding multimodal ground truths uses only the last frame to measure the similarity between sequences. However, not only does this lose out on motion information, but persons of different sizes with the same motion may not be considered for multimodal ground truth. Hence, we propose computing the ground truths by using the last three frames and scaling the skeleton while retaining the motion. We do this using cartesian to spherical coordinate transformations.
  • Figure 4: Controllability. MotionMap can also be used with auxiliary data such as action labels for controllable pose forecasting. Since each pose sequence is associated with an embedding and action label, a spatial distribution over the space of motions is the same as that over the action labels. This allows for the use of MotionMap to select modes based on the confidence as well as user preference for the forecasted action. We illustrate this distribution over the space of motions $\leftrightarrow$ actions for an example input from the Human3.6M dataset.
  • Figure 5: Ranking. Since MotionMap can predict variable number of modes with their associated confidences, our method also allows us to rank predictions. For instance, the highest ranked prediction (top row among $\hat{Y}$) closely matches the ground truth motion. However, rare modes (bottom row) are ranked low since the corresponding mode has lower confidence.
  • ...and 9 more figures