Table of Contents
Fetching ...

Photo3D: Advancing Photorealistic 3D Generation through Structure-Aligned Detail Enhancement

Xinyue Liang, Zhinyuan Ma, Lingchen Sun, Yanjun Guo, Lei Zhang

TL;DR

The paper tackles the gap between geometrically plausible 3D assets and photorealistic appearance by introducing Photo3D, which couples a structure-aligned multi-view synthesis pipeline with a realism-focused detail enhancement scheme guided by GPT-4o-Image. It builds Photo3D-MV, a large, 3D-annotated multi-view dataset, and formulates perceptual adaptation (CLIP) and semantic structure matching (DINOv3) losses to refine appearance while preserving geometry. Paradigm-specific training strategies enable Photo3D to boost both geometry–texture coupled and decoupled 3D-native generators, achieving state-of-the-art photorealistic 3D generation across benchmarks. The work demonstrates how 2D realism priors can effectively augment limited 3D texture data, enabling more convincing and diverse 3D content. Limitations include residual bias from the image generator, which can be mitigated as image synthesis models evolve.

Abstract

Although recent 3D-native generators have made great progress in synthesizing reliable geometry, they still fall short in achieving realistic appearances. A key obstacle lies in the lack of diverse and high-quality real-world 3D assets with rich texture details, since capturing such data is intrinsically difficult due to the diverse scales of scenes, non-rigid motions of objects, and the limited precision of 3D scanners. We introduce Photo3D, a framework for advancing photorealistic 3D generation, which is driven by the image data generated by the GPT-4o-Image model. Considering that the generated images can distort 3D structures due to their lack of multi-view consistency, we design a structure-aligned multi-view synthesis pipeline and construct a detail-enhanced multi-view dataset paired with 3D geometry. Building on it, we present a realistic detail enhancement scheme that leverages perceptual feature adaptation and semantic structure matching to enforce appearance consistency with realistic details while preserving the structural consistency with the 3D-native geometry. Our scheme is general to different 3D-native generators, and we present dedicated training strategies to facilitate the optimization of geometry-texture coupled and decoupled 3D-native generation paradigms. Experiments demonstrate that Photo3D generalizes well across diverse 3D-native generation paradigms and achieves state-of-the-art photorealistic 3D generation performance.

Photo3D: Advancing Photorealistic 3D Generation through Structure-Aligned Detail Enhancement

TL;DR

The paper tackles the gap between geometrically plausible 3D assets and photorealistic appearance by introducing Photo3D, which couples a structure-aligned multi-view synthesis pipeline with a realism-focused detail enhancement scheme guided by GPT-4o-Image. It builds Photo3D-MV, a large, 3D-annotated multi-view dataset, and formulates perceptual adaptation (CLIP) and semantic structure matching (DINOv3) losses to refine appearance while preserving geometry. Paradigm-specific training strategies enable Photo3D to boost both geometry–texture coupled and decoupled 3D-native generators, achieving state-of-the-art photorealistic 3D generation across benchmarks. The work demonstrates how 2D realism priors can effectively augment limited 3D texture data, enabling more convincing and diverse 3D content. Limitations include residual bias from the image generator, which can be mitigated as image synthesis models evolve.

Abstract

Although recent 3D-native generators have made great progress in synthesizing reliable geometry, they still fall short in achieving realistic appearances. A key obstacle lies in the lack of diverse and high-quality real-world 3D assets with rich texture details, since capturing such data is intrinsically difficult due to the diverse scales of scenes, non-rigid motions of objects, and the limited precision of 3D scanners. We introduce Photo3D, a framework for advancing photorealistic 3D generation, which is driven by the image data generated by the GPT-4o-Image model. Considering that the generated images can distort 3D structures due to their lack of multi-view consistency, we design a structure-aligned multi-view synthesis pipeline and construct a detail-enhanced multi-view dataset paired with 3D geometry. Building on it, we present a realistic detail enhancement scheme that leverages perceptual feature adaptation and semantic structure matching to enforce appearance consistency with realistic details while preserving the structural consistency with the 3D-native geometry. Our scheme is general to different 3D-native generators, and we present dedicated training strategies to facilitate the optimization of geometry-texture coupled and decoupled 3D-native generation paradigms. Experiments demonstrate that Photo3D generalizes well across diverse 3D-native generation paradigms and achieves state-of-the-art photorealistic 3D generation performance.

Paper Structure

This paper contains 14 sections, 7 equations, 12 figures, 2 tables.

Figures (12)

  • Figure 1: Photorealistic 3D objects in the foreground are generated by Photo3D, featuring rich realistic details and stable 3D structure.
  • Figure 2: Overview of Photo3D. We first construct Photo3D‑MV, a realistic, detail‑enhanced multi‑view dataset paired with 3D geometry, and propose associated schemes to learn realistic 3D appearance details. Paradigm‑specific training strategies are designed for geometry–texture coupled and decoupled paradigms: (a) diffusion‑based 3D‑native generator (e.g., Trellis xiang2025structured); (b) single feed‑forward 3D‑native texturing model (e.g., TexGaussian xiong2025texgaussian); and (c) diffusion‑based multi‑view texturing model (e.g., Step1X-3D li2025step1x).
  • Figure 3: The structure-aligned realistic multi-view synthesis pipeline for Photo3D-MV dataset. We first process text prompts from DiffusionDB wang2022diffusiondb to obtain object‑centric descriptions with realistic attributes. We then use Flux.1‑Dev flux2024 to generate images, serving as inputs for 3D generation with Trellis xiang2025structured. Finally, we employ GPT‑4o‑Image hurst2024gpt to refine the multi‑view 3D renderings into structure‑aligned, photorealistic images. These realistic multi-views, together with text descriptions and the generated 3D assets, constitute Photo3D-MV.
  • Figure 4: Diverse distribution of top categories in Photo3D-MV.
  • Figure 5: Computation of $\mathcal{L}_{\text{adapt}}$ and $\mathcal{L}_{\text{match}}$ between a synthesized image and the corresponding GT image. (a) Same‑color boxes indicate crops for computing $\mathcal{L}_{\text{adapt}}$. (b) Same‑color patches denote sampled semantically matched patches contributing to $\mathcal{L}_{\text{match}}$.
  • ...and 7 more figures