Table of Contents
Fetching ...

ReStyle3D: Scene-Level Appearance Transfer with Semantic Correspondences

Liyuan Zhu, Shengqu Cai, Shengyu Huang, Gordon Wetzstein, Naji Khosravan, Iro Armeni

TL;DR

ReStyle3D addresses scene-level appearance transfer by leveraging explicit semantic correspondences via open-vocabulary panoptic segmentation and embedding them into a training-free semantic attention mechanism within a pretrained diffusion model. It then lifts the stylization to additional views using a warp-and-refine diffusion network guided by monocular depth and pixel correspondences, implemented in an auto-regressive framework to maintain 3D-consistent results. The approach delivers improved structure preservation, perceptual style fidelity, and multi-view coherence over prior 2D and 3D editing methods, with supporting quantitative and user studies. The framework enables practical interior-design and virtual-staging applications without requiring explicit 3D geometry or camera poses, and is designed to be plug-and-play with existing diffusion-based pipelines.

Abstract

We introduce ReStyle3D, a novel framework for scene-level appearance transfer from a single style image to a real-world scene represented by multiple views. The method combines explicit semantic correspondences with multi-view consistency to achieve precise and coherent stylization. Unlike conventional stylization methods that apply a reference style globally, ReStyle3D uses open-vocabulary segmentation to establish dense, instance-level correspondences between the style and real-world images. This ensures that each object is stylized with semantically matched textures. It first transfers the style to a single view using a training-free semantic-attention mechanism in a diffusion model. It then lifts the stylization to additional views via a learned warp-and-refine network guided by monocular depth and pixel-wise correspondences. Experiments show that ReStyle3D consistently outperforms prior methods in structure preservation, perceptual style similarity, and multi-view coherence. User studies further validate its ability to produce photo-realistic, semantically faithful results. Our code, pretrained models, and dataset will be publicly released, to support new applications in interior design, virtual staging, and 3D-consistent stylization.

ReStyle3D: Scene-Level Appearance Transfer with Semantic Correspondences

TL;DR

ReStyle3D addresses scene-level appearance transfer by leveraging explicit semantic correspondences via open-vocabulary panoptic segmentation and embedding them into a training-free semantic attention mechanism within a pretrained diffusion model. It then lifts the stylization to additional views using a warp-and-refine diffusion network guided by monocular depth and pixel correspondences, implemented in an auto-regressive framework to maintain 3D-consistent results. The approach delivers improved structure preservation, perceptual style fidelity, and multi-view coherence over prior 2D and 3D editing methods, with supporting quantitative and user studies. The framework enables practical interior-design and virtual-staging applications without requiring explicit 3D geometry or camera poses, and is designed to be plug-and-play with existing diffusion-based pipelines.

Abstract

We introduce ReStyle3D, a novel framework for scene-level appearance transfer from a single style image to a real-world scene represented by multiple views. The method combines explicit semantic correspondences with multi-view consistency to achieve precise and coherent stylization. Unlike conventional stylization methods that apply a reference style globally, ReStyle3D uses open-vocabulary segmentation to establish dense, instance-level correspondences between the style and real-world images. This ensures that each object is stylized with semantically matched textures. It first transfers the style to a single view using a training-free semantic-attention mechanism in a diffusion model. It then lifts the stylization to additional views via a learned warp-and-refine network guided by monocular depth and pixel-wise correspondences. Experiments show that ReStyle3D consistently outperforms prior methods in structure preservation, perceptual style similarity, and multi-view coherence. User studies further validate its ability to produce photo-realistic, semantically faithful results. Our code, pretrained models, and dataset will be publicly released, to support new applications in interior design, virtual staging, and 3D-consistent stylization.

Paper Structure

This paper contains 29 sections, 5 equations, 10 figures, 7 tables.

Figures (10)

  • Figure 1: Semantic Appearance Transfer. The style and source images are first noised back to step $T$ using DDPM inversion huberman2024edit. During the generation of the stylized output, the extended self-attention layer transfers style information from the style to the output latent. This process is further guided by a semantic matching mask, which allows for precise control.
  • Figure 2: Attention Query Visualization. We visualize the attention score at two query positions, coffee table and rug. Raw attention in alaluf2024cross spilled across regions (red arrows) due to multi-instance ambiguity, semantic attention effectively confines the activation in the matched region.
  • Figure 3: Multi-view Inconsistency Caused by Separate Transfer. When stylizing each view separately, we observe inconsistencies in the results (highlighted by red arrows) due to high variance in generative modeling.
  • Figure 4: Multi-view Style Lifting. Stereo correspondences are extracted from the original image pair $(\mathbf{I}\xspace_{src}^i, \mathbf{I}\xspace_{src}^j)$ and used to warp the stylized image $\hat{\mathbf{I}\xspace}^i$ to the second image, $\mathbf{I}\xspace^j_w$. To address missing pixels from warping, we train a warp-and-refine model to complete the stylized image $\hat{\mathbf{I}\xspace}^j$. This model is applied across multiple views within our auto-regressive framework.
  • Figure 5: Image Appearance Transfer Results. Our method enables precise appearance transfer between semantically corresponding elements, evidenced by the green rug and glass table (first row), textured cabinet (second row), and bedsheets (third row). Unlike baselines that either apply global style transfer or fail to preserve structure, ReStyle3D maintains both semantic fidelity and structural integrity.
  • ...and 5 more figures