ReStyle3D: Scene-Level Appearance Transfer with Semantic Correspondences
Liyuan Zhu, Shengqu Cai, Shengyu Huang, Gordon Wetzstein, Naji Khosravan, Iro Armeni
TL;DR
ReStyle3D addresses scene-level appearance transfer by leveraging explicit semantic correspondences via open-vocabulary panoptic segmentation and embedding them into a training-free semantic attention mechanism within a pretrained diffusion model. It then lifts the stylization to additional views using a warp-and-refine diffusion network guided by monocular depth and pixel correspondences, implemented in an auto-regressive framework to maintain 3D-consistent results. The approach delivers improved structure preservation, perceptual style fidelity, and multi-view coherence over prior 2D and 3D editing methods, with supporting quantitative and user studies. The framework enables practical interior-design and virtual-staging applications without requiring explicit 3D geometry or camera poses, and is designed to be plug-and-play with existing diffusion-based pipelines.
Abstract
We introduce ReStyle3D, a novel framework for scene-level appearance transfer from a single style image to a real-world scene represented by multiple views. The method combines explicit semantic correspondences with multi-view consistency to achieve precise and coherent stylization. Unlike conventional stylization methods that apply a reference style globally, ReStyle3D uses open-vocabulary segmentation to establish dense, instance-level correspondences between the style and real-world images. This ensures that each object is stylized with semantically matched textures. It first transfers the style to a single view using a training-free semantic-attention mechanism in a diffusion model. It then lifts the stylization to additional views via a learned warp-and-refine network guided by monocular depth and pixel-wise correspondences. Experiments show that ReStyle3D consistently outperforms prior methods in structure preservation, perceptual style similarity, and multi-view coherence. User studies further validate its ability to produce photo-realistic, semantically faithful results. Our code, pretrained models, and dataset will be publicly released, to support new applications in interior design, virtual staging, and 3D-consistent stylization.
