ESCAPE: Energy-based Selective Adaptive Correction for Out-of-distribution 3D Human Pose Estimation

Luke Bidulka; Mohsen Gholami; Jiannan Zheng; Martin J. McKeown; Z. Jane Wang

ESCAPE: Energy-based Selective Adaptive Correction for Out-of-distribution 3D Human Pose Estimation

Luke Bidulka, Mohsen Gholami, Jiannan Zheng, Martin J. McKeown, Z. Jane Wang

TL;DR

ESCAPE tackles the generalization gap in 3D HPE for out-of-distribution data by introducing a lightweight, energy-based selective test-time adaptation framework. It uses a fast correction network (CNet) to fix distal keypoint errors on ID samples and a self-consistent adaptation strategy (RCNet) for OOD samples, guided by a free-energy based OOD detector. The method yields distal MPJPE gains up to 7% and state-of-the-art results on 3DPW and 3DHP while being significantly faster than prior TTA approaches, since restoration of backbone parameters is avoided and adaptation is confined to a small external network. Across multiple backbone models and datasets, ESCAPE demonstrates robust performance improvements with extensive ablations confirming the value of energy-based sample selection and the proposed correction/adaptation scheme. The approach provides a practical pathway to deploy accurate 3D HPE in-the-wild by balancing accuracy gains with inference efficiency.

Abstract

Despite recent advances in human pose estimation (HPE), poor generalization to out-of-distribution (OOD) data remains a difficult problem. While previous works have proposed Test-Time Adaptation (TTA) to bridge the train-test domain gap by refining network parameters at inference, the absence of ground-truth annotations makes it highly challenging and existing methods typically increase inference times by one or more orders of magnitude. We observe that 1) not every test time sample is OOD, and 2) HPE errors are significantly larger on distal keypoints (wrist, ankle). To this end, we propose ESCAPE: a lightweight correction and selective adaptation framework which applies a fast, forward-pass correction on most data while reserving costly TTA for OOD data. The free energy function is introduced to separate OOD samples from incoming data and a correction network is trained to estimate the errors of pretrained backbone HPE predictions on the distal keypoints. For OOD samples, we propose a novel self-consistency adaptation loss to update the correction network by leveraging the constraining relationship between distal keypoints and proximal keypoints (shoulders, hips), via a second ``reverse" network. ESCAPE improves the distal MPJPE of five popular HPE models by up to 7% on unseen data, achieves state-of-the-art results on two popular HPE benchmarks, and is significantly faster than existing adaptation methods.

ESCAPE: Energy-based Selective Adaptive Correction for Out-of-distribution 3D Human Pose Estimation

TL;DR

Abstract

Paper Structure (17 sections, 10 equations, 8 figures, 10 tables)

This paper contains 17 sections, 10 equations, 8 figures, 10 tables.

Introduction
Related Work
Method
Problem Formulation
Energy-based Sample Selection
Correction Network
Test-Time Adaptation via Self-Consistency
Experiments
Experimental Setup
Datasets
Evaluation Metrics
Quantitative Results
Qualitative Results
Ablation Studies
Conclusions and Future Work
...and 2 more sections

Figures (8)

Figure 1: High-level illustration of ESCAPE, which improves backbone model predictions by seperating harder, out-of-distribution samples and easier, in-distribution samples via an energy function (Sec. \ref{['sec:energy']}) and applying intensive test-time adaptation (Sec. \ref{['sec:TTA']}) or fast forward-pass correction (Sec. \ref{['sec:cNet']}) respectively.
Figure 2: Detailed overview of ESCAPE, the proposed selective adaptation and correction framework for 3D human pose correction. Given an input sample (I), a pre-trained backbone human pose estimator(P) predicts an initial pose (X) and the input sample is classified as in-distribution (ID) or out-of-distribution (OOD) for the backbone model by comparing the its energy score to a predetermined threshold. If the sample was an easier ID sample, only the fast forward pass distal joint correction of $\mathcal{C}$ is applied to produce the final improved pose. If instead the sample was a harder OOD sample, intensive test-time adaptation (TT-Adaptation) is used to fine-tune $\mathcal{C}$ to the current sample and the distal correction from the adapted $\mathcal{C}$ subsequently produces the final corrected pose.
Figure 3: Diagram of the residual network architecture used for $\mathcal{C}$ and $\mathcal{R}$. It consists of an embedding module, a series of N residual blocks, and an output linear layer. The main building block is a linear layer followed by batch norm, ReLU activation, and dropout. The embedding consists of one such building block, while each residual block consists of two of these blocks in series, wrapped by a residual connection.
Figure 4: Strong correlation between the proposed self-consistency loss and the ground-truth 3D prediction error of CNet on 3DPW dataset, with CLIFF as the backbone pose estimator. Each blue point represents an individual test image, while binned averages are plotted in red.
Figure 5: Example ESCAPE corrections to backbone predictions on samples from 3DPW. The top row shows images input to the backbone estimator and the bottom row shows the GT, backbone predicted, and corrected backbone predicted 3D poses.
...and 3 more figures

ESCAPE: Energy-based Selective Adaptive Correction for Out-of-distribution 3D Human Pose Estimation

TL;DR

Abstract

ESCAPE: Energy-based Selective Adaptive Correction for Out-of-distribution 3D Human Pose Estimation

Authors

TL;DR

Abstract

Table of Contents

Figures (8)