DiMeR: Disentangled Mesh Reconstruction Model
Lutao Jiang, Jiantao Lin, Kanghao Chen, Wenhang Ge, Xin Yang, Yifan Jiang, Yuanhuiyi Lyu, Xu Zheng, Yinchuan Li, Yingcong Chen
TL;DR
DiMeR tackles geometry-texture ambiguity in mesh reconstruction by disentangling geometry (from normal maps) and texture (from RGB) into two specialized branches with 3D supervision. The geometry branch uses normal maps as input and enforces 3D consistency with eikonal loss, GT SDF, and PBR-based lighting, while the texture branch derives appearance from RGB via a texture field; mesh extraction is streamlined to support higher resolution. The approach leverages contemporary normal-prediction foundation models and 2.5D diffusion for multi-view inputs, enabling sparse-view, single-image, and text-to-3D tasks. Empirically, DiMeR achieves substantial Chamfer-Distance reductions on GSO and OmniObject3D compared with baselines and demonstrates robust performance across input modalities and tasks.
Abstract
We propose DiMeR, a novel geometry-texture disentangled feed-forward model with 3D supervision for sparse-view mesh reconstruction. Existing methods confront two persistent obstacles: (i) textures can conceal geometric errors, i.e., visually plausible images can be rendered even with wrong geometry, producing multiple ambiguous optimization objectives in geometry-texture mixed solution space for similar objects; and (ii) prevailing mesh extraction methods are redundant, unstable, and lack 3D supervision. To solve these challenges, we rethink the inductive bias for mesh reconstruction. First, we disentangle the unified geometry-texture solution space, where a single input admits multiple feasible solutions, into geometry and texture spaces individually. Specifically, given that normal maps are strictly consistent with geometry and accurately capture surface variations, the normal maps serve as the sole input for geometry prediction in DiMeR, while the texture is estimated from RGB images. Second, we streamline the algorithm of mesh extraction by eliminating modules with low performance/cost ratios and redesigning regularization losses with 3D supervision. Notably, DiMeR still accepts raw RGB images as input by leveraging foundation models for normal prediction. Extensive experiments demonstrate that DiMeR generalises across sparse-view-, single-image-, and text-to-3D tasks, consistently outperforming baselines. On the GSO and OmniObject3D datasets, DiMeR significantly reduces Chamfer Distance by more than 30%.
