SEAL-pose: Enhancing 3D Human Pose Estimation via a Learned Loss for Structural Consistency
Yeonsung Kim, Junggeun Do, Seunguk Do, Sangmin Kim, Jaesik Park, Jay-Yoon Lee
TL;DR
SEAL-pose introduces a trainable loss-net that learns structural plausibility for 3D human pose estimation by operating on a skeleton-aware graph and conditioning on 2D observations. A Graphormer-based loss-net (with an optional MLP variant) is trained in an alternating fashion with the pose-net, using early fusion of 2D and 3D cues and synthetic/hard negatives to enforce global and local pose consistency without hand-crafted priors. Structural metrics such as Limb Symmetry Error ($LSE$) and Body Segment Length Error ($BSLE$) quantify plausibility, and across Human3.6M, MPI-INF-3DHP, and H3WB the method yields lower MPJPE and improved plausibility across single- and multi-frame backbones with no test-time overhead. The results suggest that a data-driven, differentiable loss that encodes skeletal structure can serve as a flexible alternative to explicit priors, enabling more anatomically coherent 3D poses in diverse settings and real-world applications.
Abstract
3D human pose estimation (HPE) is characterized by intricate local and global dependencies among joints. Conventional supervised losses are limited in capturing these correlations because they treat each joint independently. Previous studies have attempted to promote structural consistency through manually designed priors or rule-based constraints; however, these approaches typically require manual specification and are often non-differentiable, limiting their use as end-to-end training objectives. We propose SEAL-pose, a data-driven framework in which a learnable loss-net trains a pose-net by evaluating structural plausibility. Rather than relying on hand-crafted priors, our joint-graph-based design enables the loss-net to learn complex structural dependencies directly from data. Extensive experiments on three 3D HPE benchmarks with eight backbones show that SEAL-pose reduces per-joint errors and improves pose plausibility compared with the corresponding backbones across all settings. Beyond improving each backbone, SEAL-pose also outperforms models with explicit structural constraints, despite not enforcing any such constraints. Finally, we analyze the relationship between the loss-net and structural consistency, and evaluate SEAL-pose in cross-dataset and in-the-wild settings.
