Table of Contents
Fetching ...

Scriboora: Rethinking Human Pose Forecasting

Daniel Bermuth, Alexander Poeppel, Wolfgang Reif

TL;DR

The paper tackles robust absolute human pose forecasting by exposing reproducibility issues, introducing a unified evaluation pipeline, and proving that cross-domain adaptation of speech-to-text architectures (notably MotionConformer) yields state-of-the-art, real-time forecasts. It introduces FADE and FCE as deployment-focused metrics and demonstrates substantial resilience gains through unsupervised finetuning on noisy inputs generated from pose estimators. A major finding is that realistic noise can dramatically degrade performance unless models are pre-trained on diverse data and subsequently fine-tuned on noisy labels. Collectively, the work provides practical guidance for reproducible, deployment-ready pose forecasting and releases code and datasets to support future research.

Abstract

Human pose forecasting predicts future poses based on past observations, and has many significant applications in areas such as action recognition, autonomous driving or human-robot interaction. This paper evaluates a wide range of pose forecasting algorithms in the task of absolute pose forecasting, revealing many reproducibility issues, and provides a unified training and evaluation pipeline. After drawing a high-level analogy to the task of speech understanding, it is shown that recent speech models can be efficiently adapted to the task of pose forecasting, and improve current state-of-the-art performance. At last the robustness of the models is evaluated, using noisy joint coordinates obtained from a pose estimator model, to reflect a realistic type of noise, which is more close to real-world applications. For this a new dataset variation is introduced, and it is shown that estimated poses result in a substantial performance degradation, and how much of it can be recovered again by unsupervised finetuning.

Scriboora: Rethinking Human Pose Forecasting

TL;DR

The paper tackles robust absolute human pose forecasting by exposing reproducibility issues, introducing a unified evaluation pipeline, and proving that cross-domain adaptation of speech-to-text architectures (notably MotionConformer) yields state-of-the-art, real-time forecasts. It introduces FADE and FCE as deployment-focused metrics and demonstrates substantial resilience gains through unsupervised finetuning on noisy inputs generated from pose estimators. A major finding is that realistic noise can dramatically degrade performance unless models are pre-trained on diverse data and subsequently fine-tuned on noisy labels. Collectively, the work provides practical guidance for reproducible, deployment-ready pose forecasting and releases code and datasets to support future research.

Abstract

Human pose forecasting predicts future poses based on past observations, and has many significant applications in areas such as action recognition, autonomous driving or human-robot interaction. This paper evaluates a wide range of pose forecasting algorithms in the task of absolute pose forecasting, revealing many reproducibility issues, and provides a unified training and evaluation pipeline. After drawing a high-level analogy to the task of speech understanding, it is shown that recent speech models can be efficiently adapted to the task of pose forecasting, and improve current state-of-the-art performance. At last the robustness of the models is evaluated, using noisy joint coordinates obtained from a pose estimator model, to reflect a realistic type of noise, which is more close to real-world applications. For this a new dataset variation is introduced, and it is shown that estimated poses result in a substantial performance degradation, and how much of it can be recovered again by unsupervised finetuning.

Paper Structure

This paper contains 26 sections, 4 equations, 3 figures, 11 tables.

Figures (3)

  • Figure 1: Example of an absolute pose forecast of a walking person. The green skeletons visualize the input sequence of the prediction model, the red ones the predicted future poses, and the blue ones the ground-truth labels.
  • Figure 2: Example of a relative pose forecast of a walking person. The green skeletons visualize the input sequence of the prediction model, and the red ones the predicted future poses. Only the predicted joints are visualized, therefore this person does not have a hip (it is fixed to the same spot).
  • Figure 3: Example of improvements through finetuning on Human3.6m dataset. In blue the ground-truth, in orange the prediction. The walking movement was already well continued in (a), but especially the stride distances improved after finetuning. For better visualization some intermediate timesteps are not displayed.