Table of Contents
Fetching ...

CustomTex: High-fidelity Indoor Scene Texturing via Multi-Reference Customization

Weilin Chen, Jiahao Rao, Wenhao Wang, Xinyang Li, Xuan Cheng, Liujuan Cao

Abstract

The creation of high-fidelity, customizable 3D indoor scene textures remains a significant challenge. While text-driven methods offer flexibility, they lack the precision for fine-grained, instance-level control, and often produce textures with insufficient quality, artifacts, and baked-in shading. To overcome these limitations, we introduce CustomTex, a novel framework for instance-level, high-fidelity scene texturing driven by reference images. CustomTex takes an untextured 3D scene and a set of reference images specifying the desired appearance for each object instance, and generates a unified, high-resolution texture map. The core of our method is a dual-distillation approach that separates semantic control from pixel-level enhancement. We employ semantic-level distillation, equipped with an instance cross-attention, to ensure semantic plausibility and ``reference-instance'' alignment, and pixel-level distillation to enforce high visual fidelity. Both are unified within a Variational Score Distillation (VSD) optimization framework. Experiments demonstrate that CustomTex achieves precise instance-level consistency with reference images and produces textures with superior sharpness, reduced artifacts, and minimal baked-in shading compared to state-of-the-art methods. Our work establishes a more direct and user-friendly path to high-quality, customizable 3D scene appearance editing.

CustomTex: High-fidelity Indoor Scene Texturing via Multi-Reference Customization

Abstract

The creation of high-fidelity, customizable 3D indoor scene textures remains a significant challenge. While text-driven methods offer flexibility, they lack the precision for fine-grained, instance-level control, and often produce textures with insufficient quality, artifacts, and baked-in shading. To overcome these limitations, we introduce CustomTex, a novel framework for instance-level, high-fidelity scene texturing driven by reference images. CustomTex takes an untextured 3D scene and a set of reference images specifying the desired appearance for each object instance, and generates a unified, high-resolution texture map. The core of our method is a dual-distillation approach that separates semantic control from pixel-level enhancement. We employ semantic-level distillation, equipped with an instance cross-attention, to ensure semantic plausibility and ``reference-instance'' alignment, and pixel-level distillation to enforce high visual fidelity. Both are unified within a Variational Score Distillation (VSD) optimization framework. Experiments demonstrate that CustomTex achieves precise instance-level consistency with reference images and produces textures with superior sharpness, reduced artifacts, and minimal baked-in shading compared to state-of-the-art methods. Our work establishes a more direct and user-friendly path to high-quality, customizable 3D scene appearance editing.
Paper Structure (20 sections, 5 equations, 16 figures, 5 tables)

This paper contains 20 sections, 5 equations, 16 figures, 5 tables.

Figures (16)

  • Figure 1: CustomTex is capable of generating high-fidelity texture for a 3D scene mesh, driven by instance-specific reference images.
  • Figure 2: Pipeline of CustomTex. CustomTex textures a complete 3D indoor scene by optimizing a texture map in UV space through a dual-distillation training approach. In each iteration, the 3D scene with optimized texture is rendered from a random viewpoint, producing an RGB image, a depth map and instance masks. Instance masks are used to align each reference image's features with the correct object instance in the rendered RGB image via a specialized cross-attention. The Variational Score Distillation gradient and the Super-Resolution gradient are computed based on the well-aligned reference images condition to update the texture field.
  • Figure 3: Qualitative comparison on image-to-texture generation. All generated textures are rendered by 3ds Max software at a resolution of $2000\times2000$ for visualization. CustomTex demonstrates instance-level consistency with the reference images, while also exhibiting greater sharpness with fewer shading effects and artifacts compared with the baselines.
  • Figure 4: Qualitative comparison on close-up texture renderings.
  • Figure 5: Qualitative comparison on text-to-texture generation. The text prompt is: "The Nanyang vintage-style living room equipped with walls featuring dark wood panel textures, a brown leather sofa, a round fabric stool with floral patterns, a TV stand made of dark wood with golden handles, dark brown wooden chairs and a light-color wood coffee table." GPT-4v is used to convert this text prompt into reference image prompts for our CustomTex. All generated textures are rendered to $768\times768$ resolution images for visualization. Only CustomTex demonstrates instance-level consistency with the text prompt.
  • ...and 11 more figures