SceneTex: High-Quality Texture Synthesis for Indoor Scenes via Diffusion Priors

Dave Zhenyu Chen; Haoxuan Li; Hsin-Ying Lee; Sergey Tulyakov; Matthias Nießner

SceneTex: High-Quality Texture Synthesis for Indoor Scenes via Diffusion Priors

Dave Zhenyu Chen, Haoxuan Li, Hsin-Ying Lee, Sergey Tulyakov, Matthias Nießner

TL;DR

SceneTex introduces a diffusion-prior-based texture synthesis framework for indoor scenes that optimizes textures directly in RGB space. It uses a multiresolution texture field to capture details and a cross-attention decoder to enforce global style consistency across instances. Through depth-conditioned diffusion priors and a VSD-based objective, SceneTex achieves superior texture quality and prompt fidelity on 3D-FRONT datasets, outperforming previous methods both quantitatively and in user studies. While shading artifacts remain a limitation, the approach offers a scalable path to high-quality, style-controlled 3D scene texturing.

Abstract

We propose SceneTex, a novel method for effectively generating high-quality and style-consistent textures for indoor scenes using depth-to-image diffusion priors. Unlike previous methods that either iteratively warp 2D views onto a mesh surface or distillate diffusion latent features without accurate geometric and style cues, SceneTex formulates the texture synthesis task as an optimization problem in the RGB space where style and geometry consistency are properly reflected. At its core, SceneTex proposes a multiresolution texture field to implicitly encode the mesh appearance. We optimize the target texture via a score-distillation-based objective function in respective RGB renderings. To further secure the style consistency across views, we introduce a cross-attention decoder to predict the RGB values by cross-attending to the pre-sampled reference locations in each instance. SceneTex enables various and accurate texture synthesis for 3D-FRONT scenes, demonstrating significant improvements in visual quality and prompt fidelity over the prior texture generation methods.

SceneTex: High-Quality Texture Synthesis for Indoor Scenes via Diffusion Priors

TL;DR

Abstract

Paper Structure (23 sections, 2 equations, 12 figures, 2 tables)

This paper contains 23 sections, 2 equations, 12 figures, 2 tables.

Introduction
Related work
Method
Multiresolution Texture Field
Cross-attention Texture Decoder
Texture Field Optimization via VSD
Inference
Results
Implementation Details
Quantitative Analysis
Qualitative Results
Ablation Studies
Does texture field produce better textures than RGB tensors?
Does multiresolution texture improve the visual quality?
Does cross-attention strengthen the style consistency?
...and 8 more sections

Figures (12)

Figure 1: We introduce SceneTex, a text-driven texture synthesis architecture for 3D indoor scenes. Given scene geometries and text prompts as input, SceneTex generates high-quality and style-consistent textures via depth-to-image diffusion priors.
Figure 2: Texture synthesis pipeline. The target mesh is first projected to a given viewpoint via a rasterizer liu2019soft. Then, we render an RGB image with the proposed multiresolution texture field module. Specifically, each rasterized UV coordinate is taken as input to sample the UV embeddings from a multiresoultion texture. Afterward, the UV embeddings are mapped to an RGB image of shape $768 \times 768 \times 3$ via a cross-attention texture decoder. We use a pre-trained VAE encoder to compress the input RGB image to a $96 \times 96 \times 4$ latent feature. Finally, the Variational Score Distillation loss wang2023prolificdreamer is computed from the latent feature to update the texture field.
Figure 3: Multiresolution Texture. We use a multiresolution feature grid to encode positional features at different scale in the UV space. For a query UV coordinate, we interpolate the grid features at respective resolutions. The interpolated grid features are concatenated as the final UV embedding for the query UV coordinate.
Figure 4: Cross-attention Texture Decoder. For each rasterized UV coordinate, we apply a UV instance mask to mask out the corresponding instance texture features. Then, we obtain the rendering UV embeddings for the rasterized locations in the view. At the same time, we extract the texture features for the pre-sampled UVs scattered across this instance as the reference UV embeddings. We deploy a multi-head cross-attention module to produce the instance-aware UV embeddings. Here, we treat the rendering UV embeddings as the Query, and the reference UV embeddings as the Key and Value. Finally, a shared MLP maps the instance-aware UV embeddings to RGB values in the rendered view.
Figure 5: Qualitative comparisons. Latent-Paint metzer2022latent suffers from over-saturation and hallucinates scene components. MVDiffusion tang2023mvdiffusion delivers blurry textures and fails to reflect the input prompts. Text2Tex chen2023text2tex struggles to keep all instances style-consistent. In contrast, our method produces high-quality textures and maintains overall style-consistency across instances in the scenes. Ceilings and back-facing walls are excluded for better visualizations. Images best viewed in color.
...and 7 more figures

SceneTex: High-Quality Texture Synthesis for Indoor Scenes via Diffusion Priors

TL;DR

Abstract

SceneTex: High-Quality Texture Synthesis for Indoor Scenes via Diffusion Priors

Authors

TL;DR

Abstract

Table of Contents

Figures (12)