Table of Contents
Fetching ...

HART: Human Aligned Reconstruction Transformer

Xiyi Chen, Shaofei Wang, Marko Mihajlovic, Taewon Kang, Sergey Prokudin, Ming Lin

TL;DR

HART introduces a calibration-free, feed-forward framework for clothed human reconstruction from sparse RGB views by jointly predicting per-pixel point maps, normals, SMPL-X tightness vectors, and body-part labels with a VGGT-style transformer. A 3D-DPSR based occlusion-aware reconstruction plus a 3D-UNet refines an indicator grid to produce a watertight clothed mesh, which is tightly aligned to an underlying SMPL-X body via marker-based fitting. Geometry-informed novel view synthesis is achieved by initializing Gaussian surfels on the clothed surface and optimizing them under photometric, depth, normal, and structural losses, enabling high-fidelity rendering from sparse views. Trained on 2.3K synthetic THuman scans, HART achieves state-of-the-art results in clothed-mesh reconstruction, SMPL-X estimation, and novel-view synthesis, demonstrating strong generalization to real-world clothing and interactions and paving the way for scalable human reconstruction in practical settings.

Abstract

We introduce HART, a unified framework for sparse-view human reconstruction. Given a small set of uncalibrated RGB images of a person as input, it outputs a watertight clothed mesh, the aligned SMPL-X body mesh, and a Gaussian-splat representation for photorealistic novel-view rendering. Prior methods for clothed human reconstruction either optimize parametric templates, which overlook loose garments and human-object interactions, or train implicit functions under simplified camera assumptions, limiting applicability in real scenes. In contrast, HART predicts per-pixel 3D point maps, normals, and body correspondences, and employs an occlusion-aware Poisson reconstruction to recover complete geometry, even in self-occluded regions. These predictions also align with a parametric SMPL-X body model, ensuring that reconstructed geometry remains consistent with human structure while capturing loose clothing and interactions. These human-aligned meshes initialize Gaussian splats to further enable sparse-view rendering. While trained on only 2.3K synthetic scans, HART achieves state-of-the-art results: Chamfer Distance improves by 18-23 percent for clothed-mesh reconstruction, PA-V2V drops by 6-27 percent for SMPL-X estimation, LPIPS decreases by 15-27 percent for novel-view synthesis on a wide range of datasets. These results suggest that feed-forward transformers can serve as a scalable model for robust human reconstruction in real-world settings. Code and models will be released.

HART: Human Aligned Reconstruction Transformer

TL;DR

HART introduces a calibration-free, feed-forward framework for clothed human reconstruction from sparse RGB views by jointly predicting per-pixel point maps, normals, SMPL-X tightness vectors, and body-part labels with a VGGT-style transformer. A 3D-DPSR based occlusion-aware reconstruction plus a 3D-UNet refines an indicator grid to produce a watertight clothed mesh, which is tightly aligned to an underlying SMPL-X body via marker-based fitting. Geometry-informed novel view synthesis is achieved by initializing Gaussian surfels on the clothed surface and optimizing them under photometric, depth, normal, and structural losses, enabling high-fidelity rendering from sparse views. Trained on 2.3K synthetic THuman scans, HART achieves state-of-the-art results in clothed-mesh reconstruction, SMPL-X estimation, and novel-view synthesis, demonstrating strong generalization to real-world clothing and interactions and paving the way for scalable human reconstruction in practical settings.

Abstract

We introduce HART, a unified framework for sparse-view human reconstruction. Given a small set of uncalibrated RGB images of a person as input, it outputs a watertight clothed mesh, the aligned SMPL-X body mesh, and a Gaussian-splat representation for photorealistic novel-view rendering. Prior methods for clothed human reconstruction either optimize parametric templates, which overlook loose garments and human-object interactions, or train implicit functions under simplified camera assumptions, limiting applicability in real scenes. In contrast, HART predicts per-pixel 3D point maps, normals, and body correspondences, and employs an occlusion-aware Poisson reconstruction to recover complete geometry, even in self-occluded regions. These predictions also align with a parametric SMPL-X body model, ensuring that reconstructed geometry remains consistent with human structure while capturing loose clothing and interactions. These human-aligned meshes initialize Gaussian splats to further enable sparse-view rendering. While trained on only 2.3K synthetic scans, HART achieves state-of-the-art results: Chamfer Distance improves by 18-23 percent for clothed-mesh reconstruction, PA-V2V drops by 6-27 percent for SMPL-X estimation, LPIPS decreases by 15-27 percent for novel-view synthesis on a wide range of datasets. These results suggest that feed-forward transformers can serve as a scalable model for robust human reconstruction in real-world settings. Code and models will be released.

Paper Structure

This paper contains 38 sections, 13 equations, 9 figures, 8 tables.

Figures (9)

  • Figure 1: Given (a) uncalibrated, sparse-view human images, our method HART is a unified framework that simultaneously reconstructs (b) the underlying SMPL-X body mesh and (c) the clothed mesh. (d) Our clothed mesh prediction serves as an initialization and regularization to further enable novel view synthesis from sparse views.
  • Figure 2: Overview of our Network Architecture. Given $N$ uncalibrated human images, our HART transformer first maps input images $\{ I_i \}_{i=1}^N$ into per-pixel point maps $\hat{p}_i$, refined normal maps $\hat{\mathbf{n}}_i$, SMPL-X tightness vectors $\hat{\mathbf{v}}_i$ and body part labels $\hat{l}_i$. The oriented point maps $\hat{p}_i, \hat{\mathbf{n}}_i$ for all views are merged and converted to an indicator grid $\chi_{\mathrm{refined}}$ via Differentiable Poisson Surface Reconstruction (DPSR). A 3D-UNet $g_{\theta}$ is used for grid refinement to account for self-occlusions and a clothed mesh reconstruction $\mathbf{M}_{\mathrm{clothed}}$ can be obtained by running marching cubes. The SMPL-X tightness vectors and label maps are aggregated into body markers $\hat{\mathbf{m}}$ out of which we could optimize a SMPL-X mesh $\mathbf{M}_{\mathrm{SMPL\text{-}X}}$.
  • Figure 3: Clothed Mesh Reconstruction from 4 views. We show 1 subject from THuman 2.1 (row 1) and 2 from 2K2K test sets (rows 2–3). In contrast to various baselines, our method can recover detailed geometry in both observed and occluded regions.
  • Figure 4: SMPL-X Mesh Reconstruction from 4 Views:2 subjects from THuman (left) and 2 from 2K2K test sets (right). Keypoint-based EasyMocap and MV-SMPLify-X produce inaccurate head poses and body shapes, while ETCH often misstitches reconstructed feet/hands.
  • Figure 5: Novel View Synthesis from 4 Views. We show qualitative results for novel view synthesis on the DNA-Rendering test set. Benefiting from our accurate reconstruction, we achieve photorealistic rendering while avoiding issues present in baselines, including overly smooth appearance (LaRa), hallucinated textures (SEVA), and floater artifacts (MAtCha). Please refer to the appendix for more qualitative results.
  • ...and 4 more figures