Table of Contents
Fetching ...

CHAMP: Conformalized 3D Human Multi-Hypothesis Pose Estimators

Harry Zhang, Luca Carlone

TL;DR

This work introduces CHAMP, a novel method for learning sequence-to-sequence, multi-hypothesis 3D human poses from 2D keypoints by leveraging a conditional distribution with a diffusion model and results indicate that using a simple mean aggregation on the conformal prediction-filtered hypotheses set yields competitive results.

Abstract

We introduce CHAMP, a novel method for learning sequence-to-sequence, multi-hypothesis 3D human poses from 2D keypoints by leveraging a conditional distribution with a diffusion model. To predict a single output 3D pose sequence, we generate and aggregate multiple 3D pose hypotheses. For better aggregation results, we develop a method to score these hypotheses during training, effectively integrating conformal prediction into the learning process. This process results in a differentiable conformal predictor that is trained end2end with the 3D pose estimator. Post-training, the learned scoring model is used as the conformity score, and the 3D pose estimator is combined with a conformal predictor to select the most accurate hypotheses for downstream aggregation. Our results indicate that using a simple mean aggregation on the conformal prediction-filtered hypotheses set yields competitive results. When integrated with more sophisticated aggregation techniques, our method achieves state-of-the-art performance across various metrics and datasets while inheriting the probabilistic guarantees of conformal prediction.

CHAMP: Conformalized 3D Human Multi-Hypothesis Pose Estimators

TL;DR

This work introduces CHAMP, a novel method for learning sequence-to-sequence, multi-hypothesis 3D human poses from 2D keypoints by leveraging a conditional distribution with a diffusion model and results indicate that using a simple mean aggregation on the conformal prediction-filtered hypotheses set yields competitive results.

Abstract

We introduce CHAMP, a novel method for learning sequence-to-sequence, multi-hypothesis 3D human poses from 2D keypoints by leveraging a conditional distribution with a diffusion model. To predict a single output 3D pose sequence, we generate and aggregate multiple 3D pose hypotheses. For better aggregation results, we develop a method to score these hypotheses during training, effectively integrating conformal prediction into the learning process. This process results in a differentiable conformal predictor that is trained end2end with the 3D pose estimator. Post-training, the learned scoring model is used as the conformity score, and the 3D pose estimator is combined with a conformal predictor to select the most accurate hypotheses for downstream aggregation. Our results indicate that using a simple mean aggregation on the conformal prediction-filtered hypotheses set yields competitive results. When integrated with more sophisticated aggregation techniques, our method achieves state-of-the-art performance across various metrics and datasets while inheriting the probabilistic guarantees of conformal prediction.
Paper Structure (33 sections, 21 equations, 16 figures, 6 tables)

This paper contains 33 sections, 21 equations, 16 figures, 6 tables.

Figures (16)

  • Figure 1: CHAMP sample results obtained on in-the-wild videos collected from TikTok. Having observed 2D keypoints, CHAMP proposes multiple hypotheses of the 3D human skeleton poses, and then a conformal predictor trained end-to-end with the pose estimator refines the confidence set by filtering out low-conformity-score hypotheses. This smaller set will be used in downstream aggregation for a single output prediction.
  • Figure 2: CHAMP Overview. CHAMP takes as input a sequence of 2D keypoints detected on a series of input RGB video frames. The 2D keypoints sequence gets fed into a Diffusion Model to produce 3D keypoints hypotheses for the sequence. The output of the Diffusion Model is supervised via a Pose Loss. Then we apply differentiable CP end-to-end during training on the hypotheses sequences, resulting in a smaller confidence set. The confidence set is used to calculate an Inefficiency Loss during training. Note that we show one frame in the sequence and hard assignment for the confidence set during training for better interpretability.
  • Figure 3: Your Figure Caption Here
  • Figure 4: Comparison of conformity scores. Top: filtered hypotheses of two joints using the three scoring functions. Bottom: MPJPE (mm) values.
  • Figure 5: Comparison of #hypotheses in training and inference with 4 variants of CHAMP. Red: Naive, Yellow: CHAMP, Brown: Agg, Green: Best.
  • ...and 11 more figures