Table of Contents
Fetching ...

RoomPainter: View-Integrated Diffusion for Consistent Indoor Scene Texturing

Zhipeng Huang, Wangbo Yu, Xinhua Cheng, ChengShu Zhao, Yunyang Ge, Mingyi Guo, Li Yuan, Yonghong Tian

TL;DR

This work tackles indoor scene texture synthesis with the challenge of maintaining cross-view consistency while remaining computationally efficient. It introduces RoomPainter, a zero-shot diffusion-based framework that leverages a two-stage process: MVIS to generate a globally consistent room texture and MVRS to repaint occluded regions at the instance level, all guided by a Related View-based Attention module. The approach achieves superior global and local texture quality compared with strong baselines, while reducing generation time. By enabling high-fidelity, view-consistent indoor textures without per-view optimization, RoomPainter has broad practical impact for VR/AR, digital media, and automated scene authoring.

Abstract

Indoor scene texture synthesis has garnered significant interest due to its important potential applications in virtual reality, digital media and creative arts. Existing diffusion-model-based researches either rely on per-view inpainting techniques, which are plagued by severe cross-view inconsistencies and conspicuous seams, or adopt optimization-based approaches that involve substantial computational overhead. In this work, we present RoomPainter, a framework that seamlessly integrates efficiency and consistency to achieve high-fidelity texturing of indoor scenes. The core of RoomPainter features a zero-shot technique that effectively adapts a 2D diffusion model for 3D-consistent texture synthesis, along with a two-stage generation strategy that ensures both global and local consistency. Specifically, we introduce Attention-Guided Multi-View Integrated Sampling (MVIS) combined with a neighbor-integrated attention mechanism for zero-shot texture map generation. Using the MVIS, we firstly generate texture map for the entire room to ensure global consistency, then adopt its variant, namely Attention-Guided Multi-View Integrated Repaint Sampling (MVRS) to repaint individual instances within the room, thereby further enhancing local consistency and addressing the occlusion problem. Experiments demonstrate that RoomPainter achieves superior performance for indoor scene texture synthesis in visual quality, global consistency and generation efficiency.

RoomPainter: View-Integrated Diffusion for Consistent Indoor Scene Texturing

TL;DR

This work tackles indoor scene texture synthesis with the challenge of maintaining cross-view consistency while remaining computationally efficient. It introduces RoomPainter, a zero-shot diffusion-based framework that leverages a two-stage process: MVIS to generate a globally consistent room texture and MVRS to repaint occluded regions at the instance level, all guided by a Related View-based Attention module. The approach achieves superior global and local texture quality compared with strong baselines, while reducing generation time. By enabling high-fidelity, view-consistent indoor textures without per-view optimization, RoomPainter has broad practical impact for VR/AR, digital media, and automated scene authoring.

Abstract

Indoor scene texture synthesis has garnered significant interest due to its important potential applications in virtual reality, digital media and creative arts. Existing diffusion-model-based researches either rely on per-view inpainting techniques, which are plagued by severe cross-view inconsistencies and conspicuous seams, or adopt optimization-based approaches that involve substantial computational overhead. In this work, we present RoomPainter, a framework that seamlessly integrates efficiency and consistency to achieve high-fidelity texturing of indoor scenes. The core of RoomPainter features a zero-shot technique that effectively adapts a 2D diffusion model for 3D-consistent texture synthesis, along with a two-stage generation strategy that ensures both global and local consistency. Specifically, we introduce Attention-Guided Multi-View Integrated Sampling (MVIS) combined with a neighbor-integrated attention mechanism for zero-shot texture map generation. Using the MVIS, we firstly generate texture map for the entire room to ensure global consistency, then adopt its variant, namely Attention-Guided Multi-View Integrated Repaint Sampling (MVRS) to repaint individual instances within the room, thereby further enhancing local consistency and addressing the occlusion problem. Experiments demonstrate that RoomPainter achieves superior performance for indoor scene texture synthesis in visual quality, global consistency and generation efficiency.

Paper Structure

This paper contains 31 sections, 7 equations, 9 figures, 2 tables, 1 algorithm.

Figures (9)

  • Figure 1: Illustration of Multi-view Integrated Sampling For $N$ surrounding viewpoints in the room, centered around the room's center point and guided by the corresponding depth maps, we use a diffusion model to generate the denoised observation $\mathcal{I}_t^n$ at timestep $t$. This observation is then projected into UV space to obtain texture maps that corresponding to respective viewpoints. The texture maps from multiple viewpoints are dynamically merged to produce the texture map for the current timestep, which subsequently guides the sampling process for the next timestep.
  • Figure 2: Illustration of Multi-View Integrated Repaint Sampling(MVRS) Due to occlusion between instances, certain areas remain untextured after the first stage of texture generation. To address this issue, we perform texture generation for each instance within the room based on the painted areas. For $N$ different viewpoints of each instance, guided by the corresponding depth maps, at sampling step $t$ of MVRS, we combine the painted areas($x_{MVIS}$) with the sampling results of MVIS at step $t+1$ ($x_{0,t+1}$) using a mask $P$ to form $x_t$, which serves as the input for the sampling process of timestep t. Noise corresponding to the current timestep was added before mask combine. Upon completing the MVRS process, the texture map for a specific instance in the room is fully generated.
  • Figure 3: Qualitative comparisons. Text2Tex-H chen2023text2tex suffers from occlusion and visible seams. Text2Tex-C chen2023text2tex struggles to maintain style consistency across all instances. SceneTex chen2024scenetex produces unrealistic texture and results in blurry regions. In contrast, our method generates high-quality texture while preserving overall style consistency across instances in the scene. Ceilings and back-facing walls are excluded for improved visualizations.
  • Figure 4: Ablation studies on the multiview-consistency module. Synthesizing texture without the Related View-based attention leads to inconsistencies across views, as shown in the leftmost column. When synthesizing texture without MVIS, noticeable inconsistencies appear across different viewpoints (middle column). In contrast, our full method samples multi-view images with significantly stronger consistency. The areas of major inconsistency in the images are highlighted with red and blue boxes. In the same column, boxes of the same color indicate that they are close to each other in 3D space. Zoom in for the best view.
  • Figure 5: The interface of the questionnaire system used in user study. We present 4 rendered views from 6 different texturing results to each participant and ask them to rate the scenes across three dimensions.
  • ...and 4 more figures