CHAMP: Conformalized 3D Human Multi-Hypothesis Pose Estimators

Harry Zhang; Luca Carlone

CHAMP: Conformalized 3D Human Multi-Hypothesis Pose Estimators

Harry Zhang, Luca Carlone

TL;DR

This work introduces CHAMP, a novel method for learning sequence-to-sequence, multi-hypothesis 3D human poses from 2D keypoints by leveraging a conditional distribution with a diffusion model and results indicate that using a simple mean aggregation on the conformal prediction-filtered hypotheses set yields competitive results.

Abstract

We introduce CHAMP, a novel method for learning sequence-to-sequence, multi-hypothesis 3D human poses from 2D keypoints by leveraging a conditional distribution with a diffusion model. To predict a single output 3D pose sequence, we generate and aggregate multiple 3D pose hypotheses. For better aggregation results, we develop a method to score these hypotheses during training, effectively integrating conformal prediction into the learning process. This process results in a differentiable conformal predictor that is trained end2end with the 3D pose estimator. Post-training, the learned scoring model is used as the conformity score, and the 3D pose estimator is combined with a conformal predictor to select the most accurate hypotheses for downstream aggregation. Our results indicate that using a simple mean aggregation on the conformal prediction-filtered hypotheses set yields competitive results. When integrated with more sophisticated aggregation techniques, our method achieves state-of-the-art performance across various metrics and datasets while inheriting the probabilistic guarantees of conformal prediction.

CHAMP: Conformalized 3D Human Multi-Hypothesis Pose Estimators

TL;DR

Abstract

Paper Structure (33 sections, 21 equations, 16 figures, 6 tables)

This paper contains 33 sections, 21 equations, 16 figures, 6 tables.

Introduction
Related Work
Problem Formulation
Methods
Learning 3D Human Poses with a Diffusion Model
Learning Conformalization for the Hypotheses Confidence Set
Conformalized Inference
Experiments
Results on Human-3.6M
Results on MPI-INF-3DHP
In-the-Wild Videos
Ablation Studies with Human3.6M Dataset
Exchangeability and CP Guarantee
Implementation and Training Details
Limitations
...and 18 more sections

Figures (16)

Figure 1: CHAMP sample results obtained on in-the-wild videos collected from TikTok. Having observed 2D keypoints, CHAMP proposes multiple hypotheses of the 3D human skeleton poses, and then a conformal predictor trained end-to-end with the pose estimator refines the confidence set by filtering out low-conformity-score hypotheses. This smaller set will be used in downstream aggregation for a single output prediction.
Figure 2: CHAMP Overview. CHAMP takes as input a sequence of 2D keypoints detected on a series of input RGB video frames. The 2D keypoints sequence gets fed into a Diffusion Model to produce 3D keypoints hypotheses for the sequence. The output of the Diffusion Model is supervised via a Pose Loss. Then we apply differentiable CP end-to-end during training on the hypotheses sequences, resulting in a smaller confidence set. The confidence set is used to calculate an Inefficiency Loss during training. Note that we show one frame in the sequence and hard assignment for the confidence set during training for better interpretability.
Figure 3: Your Figure Caption Here
Figure 4: Comparison of conformity scores. Top: filtered hypotheses of two joints using the three scoring functions. Bottom: MPJPE (mm) values.
Figure 5: Comparison of #hypotheses in training and inference with 4 variants of CHAMP. Red: Naive, Yellow: CHAMP, Brown: Agg, Green: Best.
...and 11 more figures

CHAMP: Conformalized 3D Human Multi-Hypothesis Pose Estimators

TL;DR

Abstract

CHAMP: Conformalized 3D Human Multi-Hypothesis Pose Estimators

Authors

TL;DR

Abstract

Table of Contents

Figures (16)