AugLift: Uncertainty Aware Depth Descriptors for Robust 2D to 3D Pose Lifting
Nikolai Warner, Wenjin Zhang, Hamid Badiozamani, Irfan Essa, Apaar Sadhwani
TL;DR
Monocular 3D human pose lifting suffers from poor generalization when 2D detections are noisy. The authors introduce AugLift, a lightweight, modular augmentation that attaches an Uncertainty-Aware Depth Descriptor (UADD) to each 2D keypoint, computed from a monocular depth map and detector confidence, expanding inputs without altering lifter architectures. Across four datasets and four lifting backbones, AugLift yields substantial out-of-distribution gains (average ~10.1% MPJPE reduction) and notable in-distribution improvements (~4.0%), with the largest benefits on occluded and novel poses. AugLift is complementary to dense image features, and a learned-depth variant (AugLiftV2) offers additional gains at the cost of interpretability, suggesting depth cues as a robust plug-in for robust 2D-to-3D pose lifting.
Abstract
Lifting based 3D human pose estimators infer 3D joints from 2D keypoints, but often struggle to generalize to real world settings with noisy 2D detections. We revisit the input to lifting and propose AugLift, a simple augmentation of standard lifting that enriches each 2D keypoint (x, y) with an Uncertainty Aware Depth Descriptor (UADD). We run a single off the shelf monocular depth estimator to obtain a depth map, and for every keypoint with detector confidence c we extract depth statistics from its confidence scaled neighborhood, forming a compact, interpretable UADD (c, d, d_min, d_max) that captures both local geometry and reliability. AugLift is modular, requires no new sensors or architectural changes, and integrates by expanding the input layer of existing lifting models. Across four datasets and four lifting architectures, AugLift boosts cross dataset (out of distribution) performance on unseen data by an average of 10.1 percent, while also improving in distribution performance by 4.0 percent as measured by MPJPE. A post hoc analysis clarifies when and why it helps: gains are largest on novel poses and significantly occluded joints, where depth statistics resolve front back ambiguities while confidence calibrates the spatial neighborhoods from which they are drawn. We also study interaction with recent image feature lifting methods and find the signals are complementary: adding UADD to image conditioned lifting yields both ID and OOD gains. A learned depth feature extension (AugLiftV2) improves performance further while trading off interpretability. Together, these results indicate that lightweight, confidence aware depth cues are a powerful plug in for robust 2D to 3D pose lifting.
