Table of Contents
Fetching ...

Improving the Robustness of 3D Human Pose Estimation: A Benchmark and Learning from Noisy Input

Trung-Hieu Hoang, Mona Zehni, Huy Phan, Duc Minh Vo, Minh N. Do

TL;DR

The paper addresses the fragility of video-based 2D-to-3D human pose lifters under real-world visual corruptions. It introduces two robustness benchmarks, Human3.6M-C and HumanEva-I-C, and demonstrates that state-of-the-art lifters degrade significantly when inputs are perturbed. To boost resilience, it proposes Temporal Additive Gaussian Noise (TAGN) as a 2D-pose augmentation and Confidence-aware Convolution (CA-Conv) to exploit detector confidence scores during lifting. Across multiple lifters and corruption types, TAGN and CA-Conv consistently improve robustness, offering new baselines and practical approaches for reliable 3D HPE in-the-wild.

Abstract

Despite the promising performance of current 3D human pose estimation techniques, understanding and enhancing their generalization on challenging in-the-wild videos remain an open problem. In this work, we focus on the robustness of 2D-to-3D pose lifters. To this end, we develop two benchmark datasets, namely Human3.6M-C and HumanEva-I-C, to examine the robustness of video-based 3D pose lifters to a wide range of common video corruptions including temporary occlusion, motion blur, and pixel-level noise. We observe the poor generalization of state-of-the-art 3D pose lifters in the presence of corruption and establish two techniques to tackle this issue. First, we introduce Temporal Additive Gaussian Noise (TAGN) as a simple yet effective 2D input pose data augmentation. Additionally, to incorporate the confidence scores output by the 2D pose detectors, we design a confidence-aware convolution (CA-Conv) block. Extensively tested on corrupted videos, the proposed strategies consistently boost the robustness of 3D pose lifters and serve as new baselines for future research.

Improving the Robustness of 3D Human Pose Estimation: A Benchmark and Learning from Noisy Input

TL;DR

The paper addresses the fragility of video-based 2D-to-3D human pose lifters under real-world visual corruptions. It introduces two robustness benchmarks, Human3.6M-C and HumanEva-I-C, and demonstrates that state-of-the-art lifters degrade significantly when inputs are perturbed. To boost resilience, it proposes Temporal Additive Gaussian Noise (TAGN) as a 2D-pose augmentation and Confidence-aware Convolution (CA-Conv) to exploit detector confidence scores during lifting. Across multiple lifters and corruption types, TAGN and CA-Conv consistently improve robustness, offering new baselines and practical approaches for reliable 3D HPE in-the-wild.

Abstract

Despite the promising performance of current 3D human pose estimation techniques, understanding and enhancing their generalization on challenging in-the-wild videos remain an open problem. In this work, we focus on the robustness of 2D-to-3D pose lifters. To this end, we develop two benchmark datasets, namely Human3.6M-C and HumanEva-I-C, to examine the robustness of video-based 3D pose lifters to a wide range of common video corruptions including temporary occlusion, motion blur, and pixel-level noise. We observe the poor generalization of state-of-the-art 3D pose lifters in the presence of corruption and establish two techniques to tackle this issue. First, we introduce Temporal Additive Gaussian Noise (TAGN) as a simple yet effective 2D input pose data augmentation. Additionally, to incorporate the confidence scores output by the 2D pose detectors, we design a confidence-aware convolution (CA-Conv) block. Extensively tested on corrupted videos, the proposed strategies consistently boost the robustness of 3D pose lifters and serve as new baselines for future research.
Paper Structure (14 sections, 1 equation, 9 figures, 5 tables)

This paper contains 14 sections, 1 equation, 9 figures, 5 tables.

Figures (9)

  • Figure 1: Illustration of a 2D-to-3D pose lifter operating on 2D poses detected from corrupted video frames (red boxes). Our proposed Temporal Additive Gaussian Noise (TAGN) serves as a 2D pose augmentation that adds jitter to the detected 2D poses of the original video frames (blue box). TAGN's goal is to improve generalization on test videos with unforeseen visual corruptions.
  • Figure 2: 3D human pose estimation in a 2D-to-3D pose lifting pipeline. The detected 2D pose by $g_\phi$ is lifted to 3D by $f_\theta$.
  • Figure 3: Histogram of the $\ell_2$ error of several 2D keypoints detected by HRNet sun2019_hrnet, after applying guided patch erasing (top) and Gaussian noise (bottom) video corruptions defined in Sec. \ref{['sec:vd_corruption']}.
  • Figure 4: Regular convolution block (left) versus our proposed confidence-aware convolution block (right).
  • Figure 5: Distribution of the $\ell_2$ error of the left shoulder's detected 2D pose, before and after applying guided patch erasing, for different confidence scores $c$(left). Joint histogram of the error in the detected 2D pose and the confidence score (right). In the right subplot, the color specifies the density. Best viewed in color.
  • ...and 4 more figures