Photo3D: Advancing Photorealistic 3D Generation through Structure-Aligned Detail Enhancement
Xinyue Liang, Zhinyuan Ma, Lingchen Sun, Yanjun Guo, Lei Zhang
TL;DR
The paper tackles the gap between geometrically plausible 3D assets and photorealistic appearance by introducing Photo3D, which couples a structure-aligned multi-view synthesis pipeline with a realism-focused detail enhancement scheme guided by GPT-4o-Image. It builds Photo3D-MV, a large, 3D-annotated multi-view dataset, and formulates perceptual adaptation (CLIP) and semantic structure matching (DINOv3) losses to refine appearance while preserving geometry. Paradigm-specific training strategies enable Photo3D to boost both geometry–texture coupled and decoupled 3D-native generators, achieving state-of-the-art photorealistic 3D generation across benchmarks. The work demonstrates how 2D realism priors can effectively augment limited 3D texture data, enabling more convincing and diverse 3D content. Limitations include residual bias from the image generator, which can be mitigated as image synthesis models evolve.
Abstract
Although recent 3D-native generators have made great progress in synthesizing reliable geometry, they still fall short in achieving realistic appearances. A key obstacle lies in the lack of diverse and high-quality real-world 3D assets with rich texture details, since capturing such data is intrinsically difficult due to the diverse scales of scenes, non-rigid motions of objects, and the limited precision of 3D scanners. We introduce Photo3D, a framework for advancing photorealistic 3D generation, which is driven by the image data generated by the GPT-4o-Image model. Considering that the generated images can distort 3D structures due to their lack of multi-view consistency, we design a structure-aligned multi-view synthesis pipeline and construct a detail-enhanced multi-view dataset paired with 3D geometry. Building on it, we present a realistic detail enhancement scheme that leverages perceptual feature adaptation and semantic structure matching to enforce appearance consistency with realistic details while preserving the structural consistency with the 3D-native geometry. Our scheme is general to different 3D-native generators, and we present dedicated training strategies to facilitate the optimization of geometry-texture coupled and decoupled 3D-native generation paradigms. Experiments demonstrate that Photo3D generalizes well across diverse 3D-native generation paradigms and achieves state-of-the-art photorealistic 3D generation performance.
