Table of Contents
Fetching ...

AugLift: Uncertainty Aware Depth Descriptors for Robust 2D to 3D Pose Lifting

Nikolai Warner, Wenjin Zhang, Hamid Badiozamani, Irfan Essa, Apaar Sadhwani

TL;DR

Monocular 3D human pose lifting suffers from poor generalization when 2D detections are noisy. The authors introduce AugLift, a lightweight, modular augmentation that attaches an Uncertainty-Aware Depth Descriptor (UADD) to each 2D keypoint, computed from a monocular depth map and detector confidence, expanding inputs without altering lifter architectures. Across four datasets and four lifting backbones, AugLift yields substantial out-of-distribution gains (average ~10.1% MPJPE reduction) and notable in-distribution improvements (~4.0%), with the largest benefits on occluded and novel poses. AugLift is complementary to dense image features, and a learned-depth variant (AugLiftV2) offers additional gains at the cost of interpretability, suggesting depth cues as a robust plug-in for robust 2D-to-3D pose lifting.

Abstract

Lifting based 3D human pose estimators infer 3D joints from 2D keypoints, but often struggle to generalize to real world settings with noisy 2D detections. We revisit the input to lifting and propose AugLift, a simple augmentation of standard lifting that enriches each 2D keypoint (x, y) with an Uncertainty Aware Depth Descriptor (UADD). We run a single off the shelf monocular depth estimator to obtain a depth map, and for every keypoint with detector confidence c we extract depth statistics from its confidence scaled neighborhood, forming a compact, interpretable UADD (c, d, d_min, d_max) that captures both local geometry and reliability. AugLift is modular, requires no new sensors or architectural changes, and integrates by expanding the input layer of existing lifting models. Across four datasets and four lifting architectures, AugLift boosts cross dataset (out of distribution) performance on unseen data by an average of 10.1 percent, while also improving in distribution performance by 4.0 percent as measured by MPJPE. A post hoc analysis clarifies when and why it helps: gains are largest on novel poses and significantly occluded joints, where depth statistics resolve front back ambiguities while confidence calibrates the spatial neighborhoods from which they are drawn. We also study interaction with recent image feature lifting methods and find the signals are complementary: adding UADD to image conditioned lifting yields both ID and OOD gains. A learned depth feature extension (AugLiftV2) improves performance further while trading off interpretability. Together, these results indicate that lightweight, confidence aware depth cues are a powerful plug in for robust 2D to 3D pose lifting.

AugLift: Uncertainty Aware Depth Descriptors for Robust 2D to 3D Pose Lifting

TL;DR

Monocular 3D human pose lifting suffers from poor generalization when 2D detections are noisy. The authors introduce AugLift, a lightweight, modular augmentation that attaches an Uncertainty-Aware Depth Descriptor (UADD) to each 2D keypoint, computed from a monocular depth map and detector confidence, expanding inputs without altering lifter architectures. Across four datasets and four lifting backbones, AugLift yields substantial out-of-distribution gains (average ~10.1% MPJPE reduction) and notable in-distribution improvements (~4.0%), with the largest benefits on occluded and novel poses. AugLift is complementary to dense image features, and a learned-depth variant (AugLiftV2) offers additional gains at the cost of interpretability, suggesting depth cues as a robust plug-in for robust 2D-to-3D pose lifting.

Abstract

Lifting based 3D human pose estimators infer 3D joints from 2D keypoints, but often struggle to generalize to real world settings with noisy 2D detections. We revisit the input to lifting and propose AugLift, a simple augmentation of standard lifting that enriches each 2D keypoint (x, y) with an Uncertainty Aware Depth Descriptor (UADD). We run a single off the shelf monocular depth estimator to obtain a depth map, and for every keypoint with detector confidence c we extract depth statistics from its confidence scaled neighborhood, forming a compact, interpretable UADD (c, d, d_min, d_max) that captures both local geometry and reliability. AugLift is modular, requires no new sensors or architectural changes, and integrates by expanding the input layer of existing lifting models. Across four datasets and four lifting architectures, AugLift boosts cross dataset (out of distribution) performance on unseen data by an average of 10.1 percent, while also improving in distribution performance by 4.0 percent as measured by MPJPE. A post hoc analysis clarifies when and why it helps: gains are largest on novel poses and significantly occluded joints, where depth statistics resolve front back ambiguities while confidence calibrates the spatial neighborhoods from which they are drawn. We also study interaction with recent image feature lifting methods and find the signals are complementary: adding UADD to image conditioned lifting yields both ID and OOD gains. A learned depth feature extension (AugLiftV2) improves performance further while trading off interpretability. Together, these results indicate that lightweight, confidence aware depth cues are a powerful plug in for robust 2D to 3D pose lifting.

Paper Structure

This paper contains 22 sections, 3 equations, 8 figures, 14 tables, 1 algorithm.

Figures (8)

  • Figure 1: AugLift Enriches Standard Lifting Inputs for Better Generalization. Standard lifting models use only sparse 2D coordinates (x,y), which creates ambiguity. Conditioning on dense image or depth maps tends to overfit to domain-specific backgrounds and appearance. AugLift instead enriches—rather than replaces—the (x,y) input using two signals: a monocular depth map and keypoint confidence. Local depth statistics (d, dmin, dmax) combined with confidence (c) form a compact Uncertainty-Aware Depth Descriptor (UADD), which delivers high-quality geometric cues that generalize well to cases with occlusion or novel poses.
  • Figure 2: How AugLift Learns to Trust Keypoints via Adaptive Uncertainty Sampling. Left: We visualize the two signals that augment the standard lifting input: 2D keypoint confidence (top) and a monocular depth map (bottom). Confidence scores act as a proxy for visibility (blue = confident, red X = unconfident), while the depth map captures the 3D structure (blue = near, red = far). Right: AugLift implements confidence-aware sampling by using the confidence score to define a sampling radius that is inversely proportional to it. As shown, low-confidence (often occluded) joints use a wider sampling radius to gather robust depth statistics while high-confidence joints use a small, localized radius for a precise estimate. This allows the model to learn to identify and distrust unreliable depth estimates.
  • Figure 3: Qualitative results demonstrating AugLift's robustness to ambiguity. These visualizations compare the baseline (Vanilla Pred, red) with our AugLift-enabled model (Ours Pred, green) against the ground truth (black). The baseline, relying on sparse 2D keypoints alone, fails on challenging, out-of-distribution poses with significant occlusion (e.g., sitting, crouching). AugLift, through its Uncertainty-Aware Depth Descriptors (UADDs), adaptively weighs geometric cues and keypoint reliability to resolve front–back ambiguities and recover a more accurate 3D structure.
  • Figure 4: Sampling dense feature maps at 2D keypoints. We show the 2D detection bounding box from RTMPose (center) aligned with its feature map, alongside the monocular depth feature map (right). We visualize the $\ell_2$ norm of channel activations per coarse patch and sample features at 2D keypoints by averaging the four nearest spatial embeddings. This design preserves the lightweight nature of standard lifting and AugLift while enabling the use of richer spatial cues from image and depth features.
  • Figure 5: Longer sequences can harm cross-dataset generalization. While in-distribution error (H36M, top-left) decreases with sequence length, out-of-distribution error on other datasets stays flat or increases. This analysis used a VideoPose3D model in the GT setting.
  • ...and 3 more figures