Table of Contents
Fetching ...

FutureHuman3D: Forecasting Complex Long-Term 3D Human Behavior from Video Observations

Christian Diller, Thomas Funkhouser, Angela Dai

TL;DR

This work tackles long-horizon 3D human behavior forecasting from 2D video by jointly predicting sequences of action labels and their characteristic 3D poses in an autoregressive framework. It leverages weak supervision from 2D action data, a differentiable 2D projection to align 3D predictions with 2D observations, and an adversarial 3D pose loss over an unpaired 3D pose database to enforce realism. The key contribution is the integration of action and pose forecasting into a single model, which yields improved accuracy for both actions and poses and enhances long-term sequence stability compared to treating the tasks separately. This approach enables scalable forecasting for applications in robotics, surveillance, and content creation with practical impact, especially where 3D ground truth data is scarce. The methodology demonstrates that jointly grounding 3D pose generation with semantic actions provides richer representations and more reliable long-term predictions under weak supervision.

Abstract

We present a generative approach to forecast long-term future human behavior in 3D, requiring only weak supervision from readily available 2D human action data. This is a fundamental task enabling many downstream applications. The required ground-truth data is hard to capture in 3D (mocap suits, expensive setups) but easy to acquire in 2D (simple RGB cameras). Thus, we design our method to only require 2D RGB data at inference time while being able to generate 3D human motion sequences. We use a differentiable 2D projection scheme in an autoregressive manner for weak supervision, and an adversarial loss for 3D regularization. Our method predicts long and complex human behavior sequences (e.g., cooking, assembly) consisting of multiple sub-actions. We tackle this in a semantically hierarchical manner, jointly predicting high-level coarse action labels together with their low-level fine-grained realizations as characteristic 3D human poses. We observe that these two action representations are coupled in nature, and joint prediction benefits both action and pose forecasting. Our experiments demonstrate the complementary nature of joint action and 3D pose prediction: our joint approach outperforms each task treated individually, enables robust longer-term sequence prediction, and improves over alternative approaches to forecast actions and characteristic 3D poses.

FutureHuman3D: Forecasting Complex Long-Term 3D Human Behavior from Video Observations

TL;DR

This work tackles long-horizon 3D human behavior forecasting from 2D video by jointly predicting sequences of action labels and their characteristic 3D poses in an autoregressive framework. It leverages weak supervision from 2D action data, a differentiable 2D projection to align 3D predictions with 2D observations, and an adversarial 3D pose loss over an unpaired 3D pose database to enforce realism. The key contribution is the integration of action and pose forecasting into a single model, which yields improved accuracy for both actions and poses and enhances long-term sequence stability compared to treating the tasks separately. This approach enables scalable forecasting for applications in robotics, surveillance, and content creation with practical impact, especially where 3D ground truth data is scarce. The methodology demonstrates that jointly grounding 3D pose generation with semantic actions provides richer representations and more reliable long-term predictions under weak supervision.

Abstract

We present a generative approach to forecast long-term future human behavior in 3D, requiring only weak supervision from readily available 2D human action data. This is a fundamental task enabling many downstream applications. The required ground-truth data is hard to capture in 3D (mocap suits, expensive setups) but easy to acquire in 2D (simple RGB cameras). Thus, we design our method to only require 2D RGB data at inference time while being able to generate 3D human motion sequences. We use a differentiable 2D projection scheme in an autoregressive manner for weak supervision, and an adversarial loss for 3D regularization. Our method predicts long and complex human behavior sequences (e.g., cooking, assembly) consisting of multiple sub-actions. We tackle this in a semantically hierarchical manner, jointly predicting high-level coarse action labels together with their low-level fine-grained realizations as characteristic 3D human poses. We observe that these two action representations are coupled in nature, and joint prediction benefits both action and pose forecasting. Our experiments demonstrate the complementary nature of joint action and 3D pose prediction: our joint approach outperforms each task treated individually, enables robust longer-term sequence prediction, and improves over alternative approaches to forecast actions and characteristic 3D poses.
Paper Structure (38 sections, 3 equations, 7 figures, 13 tables)

This paper contains 38 sections, 3 equations, 7 figures, 13 tables.

Figures (7)

  • Figure 1: We propose a novel generative approach to model long-term future human behavior by jointly forecasting a sequence of coarse action labels and their concrete realizations as 3D body poses. For broad applicability, our autoregressive method only requires weak supervision and past observations in the form of 2D RGB video data, together with a database of uncorrelated 3D human poses.
  • Figure 2: Our approach takes as input a sequence of RGB images, from which 2D poses are extracted, as well as their corresponding action label and initial set of objects. Each input is encoded into a joint latent space to jointly predict the next action label and characteristic 3D pose. While action labels are directly supervised, the 3D pose decoder is trained to match the next 2D action pose using differentiable projection, and an adversarial 3D loss encourages valid 3D pose prediction.
  • Figure 3: Action accuracy over time. Our joint action-characteristic pose forecasting enables more robust autoregressive action forecasting than action prediction without considering pose.
  • Figure 4: Qualitative comparison between DLow yuan2020dlow, GSPS mao2021gsps, STARS DBLP:conf/eccv/XuWG22, and our method on IKEA-ASM ben2021ikea data. For each method, we show the 3D predicted pose projected into the 2D target view, without background (small) and with background for context (full size). Our joint reasoning captures the individual characteristic action poses more faithfully while producing spatially plausible 3D poses.
  • Figure 5: Qualitative comparison between DLow yuan2020dlow, GSPS mao2021gsps, STARS DBLP:conf/eccv/XuWG22, and our method on two sequences (left and right) from MPII Cooking II rohrbach15ijcv. For each method, we show the 3D predicted pose projected into 2D, without background (small) and with background for context (full size). By considering both 3D pose and action forecasting together, we more effectively forecast the longer-term behavior.
  • ...and 2 more figures