Table of Contents
Fetching ...

SKEL-CF: Coarse-to-Fine Biomechanical Skeleton and Surface Mesh Recovery

Da Li, Jiping Jin, Xuanlong Yu, Wei Liu, Xiaodong Cun, Kai Chen, Rui Fan, Jiangang Kong, Xi Shen

TL;DR

SKEL-CF introduces a coarse-to-fine transformer framework for estimating anatomically constrained SKEL parameters from a single image, addressing biomechanical realism in 3D human mesh recovery. By constructing 4DHuman-SKEL and incorporating an explicit camera model, SKEL-CF achieves state-of-the-art performance among SKEL-based methods and remains competitive with leading SMPL-based approaches, especially on challenging MOYO data. The paper demonstrates substantial quantitative gains (e.g., MPJPE and PA-MPJPE) and improved visual fidelity of both skeletal and surface reconstructions, reinforced by ablations and per-layer attention analyses. This work advances the bridge between computer vision and biomechanics by delivering a scalable, anatomically faithful pipeline for motion analysis and biomechanics applications.

Abstract

Parametric 3D human models such as SMPL have driven significant advances in human pose and shape estimation, yet their simplified kinematics limit biomechanical realism. The recently proposed SKEL model addresses this limitation by re-rigging SMPL with an anatomically accurate skeleton. However, estimating SKEL parameters directly remains challenging due to limited training data, perspective ambiguities, and the inherent complexity of human articulation. We introduce SKEL-CF, a coarse-to-fine framework for SKEL parameter estimation. SKEL-CF employs a transformer-based encoder-decoder architecture, where the encoder predicts coarse camera and SKEL parameters, and the decoder progressively refines them in successive layers. To ensure anatomically consistent supervision, we convert the existing SMPL-based dataset 4DHuman into a SKEL-aligned version, 4DHuman-SKEL, providing high-quality training data for SKEL estimation. In addition, to mitigate depth and scale ambiguities, we explicitly incorporate camera modeling into the SKEL-CF pipeline and demonstrate its importance across diverse viewpoints. Extensive experiments validate the effectiveness of the proposed design. On the challenging MOYO dataset, SKEL-CF achieves 85.0 MPJPE / 51.4 PA-MPJPE, significantly outperforming the previous SKEL-based state-of-the-art HSMR (104.5 / 79.6). These results establish SKEL-CF as a scalable and anatomically faithful framework for human motion analysis, bridging the gap between computer vision and biomechanics. Our implementation is available on the project page: https://pokerman8.github.io/SKEL-CF/.

SKEL-CF: Coarse-to-Fine Biomechanical Skeleton and Surface Mesh Recovery

TL;DR

SKEL-CF introduces a coarse-to-fine transformer framework for estimating anatomically constrained SKEL parameters from a single image, addressing biomechanical realism in 3D human mesh recovery. By constructing 4DHuman-SKEL and incorporating an explicit camera model, SKEL-CF achieves state-of-the-art performance among SKEL-based methods and remains competitive with leading SMPL-based approaches, especially on challenging MOYO data. The paper demonstrates substantial quantitative gains (e.g., MPJPE and PA-MPJPE) and improved visual fidelity of both skeletal and surface reconstructions, reinforced by ablations and per-layer attention analyses. This work advances the bridge between computer vision and biomechanics by delivering a scalable, anatomically faithful pipeline for motion analysis and biomechanics applications.

Abstract

Parametric 3D human models such as SMPL have driven significant advances in human pose and shape estimation, yet their simplified kinematics limit biomechanical realism. The recently proposed SKEL model addresses this limitation by re-rigging SMPL with an anatomically accurate skeleton. However, estimating SKEL parameters directly remains challenging due to limited training data, perspective ambiguities, and the inherent complexity of human articulation. We introduce SKEL-CF, a coarse-to-fine framework for SKEL parameter estimation. SKEL-CF employs a transformer-based encoder-decoder architecture, where the encoder predicts coarse camera and SKEL parameters, and the decoder progressively refines them in successive layers. To ensure anatomically consistent supervision, we convert the existing SMPL-based dataset 4DHuman into a SKEL-aligned version, 4DHuman-SKEL, providing high-quality training data for SKEL estimation. In addition, to mitigate depth and scale ambiguities, we explicitly incorporate camera modeling into the SKEL-CF pipeline and demonstrate its importance across diverse viewpoints. Extensive experiments validate the effectiveness of the proposed design. On the challenging MOYO dataset, SKEL-CF achieves 85.0 MPJPE / 51.4 PA-MPJPE, significantly outperforming the previous SKEL-based state-of-the-art HSMR (104.5 / 79.6). These results establish SKEL-CF as a scalable and anatomically faithful framework for human motion analysis, bridging the gap between computer vision and biomechanics. Our implementation is available on the project page: https://pokerman8.github.io/SKEL-CF/.

Paper Structure

This paper contains 35 sections, 6 equations, 10 figures, 5 tables.

Figures (10)

  • Figure 1: Visualization of SKEL-CF on web images. Compared with SMPL-based state-or-the-art CameraHMR patel2024camerahmr, our SKEL-based model produces more natural joint motions (see the side-view zoomed knee in blue). Compared with HSMR xia2025hsmr, the state-of-the-art SKEL-based method, SKEL-CF achieves more accurate skeleton and mesh reconstruction (see zoomed hand in orange).
  • Figure 2: Overview of the proposed SKEL-CF. Our method estimates SKEL parameters from a single image using an encoder–decoder architecture that performs coarse-to-fine estimation. The encoder produces initial predictions of the camera extrinsics $\boldsymbol{\pi}$, shape parameters $\boldsymbol{\beta}$, and pose parameters $\boldsymbol{\theta}$. The decoder then refines these predictions progressively across layers. In addition, we adopt the camera model from CameraHMR patel2024camerahmr to estimate the camera intrinsics.
  • Figure 3: Visual comparison between the proposed SKEL-CF and HSMR xia2025hsmr. Our proposed SKEL-CF achieves more precise skeletal estimations (best viewed in zoom). Additional visual results are provided in the supplementary material.
  • Figure 4: Visual comparison between the proposed SKEL-CF and CameraHMR patel2024camerahmr. The proposed SKEL-CF is built upon the SKEL SKEL representation, which enforces anatomically consistent joint motion, resulting in more natural pose predictions compared to the SMPL loper2015smpl-based CameraHMR patel2024camerahmr (best viewed in zoom). Additional visual examples are provided in the supplementary material.
  • Figure 5: Illustration of the mesh refinement process. The light green meshes represent the initial coarse estimations, while the dark green meshes denote the progressively refined final results.
  • ...and 5 more figures