HART: Human Aligned Reconstruction Transformer
Xiyi Chen, Shaofei Wang, Marko Mihajlovic, Taewon Kang, Sergey Prokudin, Ming Lin
TL;DR
HART introduces a calibration-free, feed-forward framework for clothed human reconstruction from sparse RGB views by jointly predicting per-pixel point maps, normals, SMPL-X tightness vectors, and body-part labels with a VGGT-style transformer. A 3D-DPSR based occlusion-aware reconstruction plus a 3D-UNet refines an indicator grid to produce a watertight clothed mesh, which is tightly aligned to an underlying SMPL-X body via marker-based fitting. Geometry-informed novel view synthesis is achieved by initializing Gaussian surfels on the clothed surface and optimizing them under photometric, depth, normal, and structural losses, enabling high-fidelity rendering from sparse views. Trained on 2.3K synthetic THuman scans, HART achieves state-of-the-art results in clothed-mesh reconstruction, SMPL-X estimation, and novel-view synthesis, demonstrating strong generalization to real-world clothing and interactions and paving the way for scalable human reconstruction in practical settings.
Abstract
We introduce HART, a unified framework for sparse-view human reconstruction. Given a small set of uncalibrated RGB images of a person as input, it outputs a watertight clothed mesh, the aligned SMPL-X body mesh, and a Gaussian-splat representation for photorealistic novel-view rendering. Prior methods for clothed human reconstruction either optimize parametric templates, which overlook loose garments and human-object interactions, or train implicit functions under simplified camera assumptions, limiting applicability in real scenes. In contrast, HART predicts per-pixel 3D point maps, normals, and body correspondences, and employs an occlusion-aware Poisson reconstruction to recover complete geometry, even in self-occluded regions. These predictions also align with a parametric SMPL-X body model, ensuring that reconstructed geometry remains consistent with human structure while capturing loose clothing and interactions. These human-aligned meshes initialize Gaussian splats to further enable sparse-view rendering. While trained on only 2.3K synthetic scans, HART achieves state-of-the-art results: Chamfer Distance improves by 18-23 percent for clothed-mesh reconstruction, PA-V2V drops by 6-27 percent for SMPL-X estimation, LPIPS decreases by 15-27 percent for novel-view synthesis on a wide range of datasets. These results suggest that feed-forward transformers can serve as a scalable model for robust human reconstruction in real-world settings. Code and models will be released.
