Table of Contents
Fetching ...

SEAL-pose: Enhancing 3D Human Pose Estimation via a Learned Loss for Structural Consistency

Yeonsung Kim, Junggeun Do, Seunguk Do, Sangmin Kim, Jaesik Park, Jay-Yoon Lee

TL;DR

SEAL-pose introduces a trainable loss-net that learns structural plausibility for 3D human pose estimation by operating on a skeleton-aware graph and conditioning on 2D observations. A Graphormer-based loss-net (with an optional MLP variant) is trained in an alternating fashion with the pose-net, using early fusion of 2D and 3D cues and synthetic/hard negatives to enforce global and local pose consistency without hand-crafted priors. Structural metrics such as Limb Symmetry Error ($LSE$) and Body Segment Length Error ($BSLE$) quantify plausibility, and across Human3.6M, MPI-INF-3DHP, and H3WB the method yields lower MPJPE and improved plausibility across single- and multi-frame backbones with no test-time overhead. The results suggest that a data-driven, differentiable loss that encodes skeletal structure can serve as a flexible alternative to explicit priors, enabling more anatomically coherent 3D poses in diverse settings and real-world applications.

Abstract

3D human pose estimation (HPE) is characterized by intricate local and global dependencies among joints. Conventional supervised losses are limited in capturing these correlations because they treat each joint independently. Previous studies have attempted to promote structural consistency through manually designed priors or rule-based constraints; however, these approaches typically require manual specification and are often non-differentiable, limiting their use as end-to-end training objectives. We propose SEAL-pose, a data-driven framework in which a learnable loss-net trains a pose-net by evaluating structural plausibility. Rather than relying on hand-crafted priors, our joint-graph-based design enables the loss-net to learn complex structural dependencies directly from data. Extensive experiments on three 3D HPE benchmarks with eight backbones show that SEAL-pose reduces per-joint errors and improves pose plausibility compared with the corresponding backbones across all settings. Beyond improving each backbone, SEAL-pose also outperforms models with explicit structural constraints, despite not enforcing any such constraints. Finally, we analyze the relationship between the loss-net and structural consistency, and evaluate SEAL-pose in cross-dataset and in-the-wild settings.

SEAL-pose: Enhancing 3D Human Pose Estimation via a Learned Loss for Structural Consistency

TL;DR

SEAL-pose introduces a trainable loss-net that learns structural plausibility for 3D human pose estimation by operating on a skeleton-aware graph and conditioning on 2D observations. A Graphormer-based loss-net (with an optional MLP variant) is trained in an alternating fashion with the pose-net, using early fusion of 2D and 3D cues and synthetic/hard negatives to enforce global and local pose consistency without hand-crafted priors. Structural metrics such as Limb Symmetry Error () and Body Segment Length Error () quantify plausibility, and across Human3.6M, MPI-INF-3DHP, and H3WB the method yields lower MPJPE and improved plausibility across single- and multi-frame backbones with no test-time overhead. The results suggest that a data-driven, differentiable loss that encodes skeletal structure can serve as a flexible alternative to explicit priors, enabling more anatomically coherent 3D poses in diverse settings and real-world applications.

Abstract

3D human pose estimation (HPE) is characterized by intricate local and global dependencies among joints. Conventional supervised losses are limited in capturing these correlations because they treat each joint independently. Previous studies have attempted to promote structural consistency through manually designed priors or rule-based constraints; however, these approaches typically require manual specification and are often non-differentiable, limiting their use as end-to-end training objectives. We propose SEAL-pose, a data-driven framework in which a learnable loss-net trains a pose-net by evaluating structural plausibility. Rather than relying on hand-crafted priors, our joint-graph-based design enables the loss-net to learn complex structural dependencies directly from data. Extensive experiments on three 3D HPE benchmarks with eight backbones show that SEAL-pose reduces per-joint errors and improves pose plausibility compared with the corresponding backbones across all settings. Beyond improving each backbone, SEAL-pose also outperforms models with explicit structural constraints, despite not enforcing any such constraints. Finally, we analyze the relationship between the loss-net and structural consistency, and evaluate SEAL-pose in cross-dataset and in-the-wild settings.
Paper Structure (40 sections, 13 equations, 11 figures, 9 tables, 1 algorithm)

This paper contains 40 sections, 13 equations, 11 figures, 9 tables, 1 algorithm.

Figures (11)

  • Figure 1: SEAL-pose improves structural consistency by preserving skeletal symmetry and kinematic connectivity; under the same 2D input, KTPFormer (baseline) can exhibit local joint inconsistencies (e.g., ankle/wrist) that propagate to the overall limb configuration.
  • Figure 2: (a) Overview of SEAL-pose. SEAL-pose combines a pose-net $F_\phi$ that lifts 2D keypoints $x_i$ to a predicted 3D keypoints $\tilde{y}_i$ with a loss-net $E_\theta$ (graph-based or MLP) that predicts an energy score, where $y_i$ denotes the ground-truth 3D pose. It adopts alternating optimization: it first freezes $E_{\theta}$ and optimizes $F_{\phi}$ to minimize the energy of predicted poses. Then, it freezes $F_{\phi}$ and trains $E_{\theta}$ so that its energy better reflects pose quality. (b-c) Loss-net Architecture Variants.
  • Figure 3: Qualitative Comparison of Predicted Poses on H36M. Predictions from SEAL-pose (bottom, blue) demonstrate clear improvements over the baseline (top, red) by producing structures closer to the ground-truth human pose (black).
  • Figure 4: Comparison of structural consistency in MPI-INF-3DHP. Average structural inconsistency measures (from left to right: LSE, BSLE, LLE) are displayed for predictions of baselines (red) and SEAL-pose (blue) binned by P-MPJPE. SEAL-pose consistently achieves lower structural errors even under similar P-MPJPEs.
  • Figure 5: Gradient-Based Inference results on MPI-INF-3DHP. P-MPJPE, LSE, LLE, and BSLE all decrease steadily over iterations, indicating that the loss-net effectively captures structural plausibility and provides meaningful corrective feedback to the pose-net.
  • ...and 6 more figures