Table of Contents
Fetching ...

Improved 3D Scene Stylization via Text-Guided Generative Image Editing with Region-Based Control

Haruo Fujiwara, Yusuke Mukuta, Tatsuya Harada

TL;DR

This paper introduces techniques that enhance the quality of 3D stylization while maintaining view consistency and providing optional region-controlled style transfer, and proposes Multi-Region Importance-Weighted Sliced Wasserstein Distance Loss, allowing styles to be applied to distinct image regions using segmentation masks from off-the-shelf models.

Abstract

Recent advances in text-driven 3D scene editing and stylization, which leverage the powerful capabilities of 2D generative models, have demonstrated promising outcomes. However, challenges remain in ensuring high-quality stylization and view consistency simultaneously. Moreover, applying style consistently to different regions or objects in the scene with semantic correspondence is a challenging task. To address these limitations, we introduce techniques that enhance the quality of 3D stylization while maintaining view consistency and providing optional region-controlled style transfer. Our method achieves stylization by re-training an initial 3D representation using stylized multi-view 2D images of the source views. Therefore, ensuring both style consistency and view consistency of stylized multi-view images is crucial. We achieve this by extending the style-aligned depth-conditioned view generation framework, replacing the fully shared attention mechanism with a single reference-based attention-sharing mechanism, which effectively aligns style across different viewpoints. Additionally, inspired by recent 3D inpainting methods, we utilize a grid of multiple depth maps as a single-image reference to further strengthen view consistency among stylized images. Finally, we propose Multi-Region Importance-Weighted Sliced Wasserstein Distance Loss, allowing styles to be applied to distinct image regions using segmentation masks from off-the-shelf models. We demonstrate that this optional feature enhances the faithfulness of style transfer and enables the mixing of different styles across distinct regions of the scene. Experimental evaluations, both qualitative and quantitative, demonstrate that our pipeline effectively improves the results of text-driven 3D stylization. Project Page: https://haruolabs.github.io/improved-gs-style-page/

Improved 3D Scene Stylization via Text-Guided Generative Image Editing with Region-Based Control

TL;DR

This paper introduces techniques that enhance the quality of 3D stylization while maintaining view consistency and providing optional region-controlled style transfer, and proposes Multi-Region Importance-Weighted Sliced Wasserstein Distance Loss, allowing styles to be applied to distinct image regions using segmentation masks from off-the-shelf models.

Abstract

Recent advances in text-driven 3D scene editing and stylization, which leverage the powerful capabilities of 2D generative models, have demonstrated promising outcomes. However, challenges remain in ensuring high-quality stylization and view consistency simultaneously. Moreover, applying style consistently to different regions or objects in the scene with semantic correspondence is a challenging task. To address these limitations, we introduce techniques that enhance the quality of 3D stylization while maintaining view consistency and providing optional region-controlled style transfer. Our method achieves stylization by re-training an initial 3D representation using stylized multi-view 2D images of the source views. Therefore, ensuring both style consistency and view consistency of stylized multi-view images is crucial. We achieve this by extending the style-aligned depth-conditioned view generation framework, replacing the fully shared attention mechanism with a single reference-based attention-sharing mechanism, which effectively aligns style across different viewpoints. Additionally, inspired by recent 3D inpainting methods, we utilize a grid of multiple depth maps as a single-image reference to further strengthen view consistency among stylized images. Finally, we propose Multi-Region Importance-Weighted Sliced Wasserstein Distance Loss, allowing styles to be applied to distinct image regions using segmentation masks from off-the-shelf models. We demonstrate that this optional feature enhances the faithfulness of style transfer and enables the mixing of different styles across distinct regions of the scene. Experimental evaluations, both qualitative and quantitative, demonstrate that our pipeline effectively improves the results of text-driven 3D stylization. Project Page: https://haruolabs.github.io/improved-gs-style-page/

Paper Structure

This paper contains 32 sections, 22 equations, 9 figures, 3 tables, 1 algorithm.

Figures (9)

  • Figure 1: Overall Multi-View Stylization Pipeline. 1. We generate stylized multi-view images of the source scene using our custom diffusion pipeline anchored on tiled depth maps. 2. Next, the source 3D scene is finetuned on the generated images. This refinement stage may take optional region masks for spatial control of the style transfer process (\ref{['mr-iw-swd']}).
  • Figure 2: Comparison of loss convergence between vanilla SWD and our proposed IW-SWD. We plot the SWD loss values over 1,500 training iterations for both vanilla SWD and our importance-weighted SWD (IW-SWD), which uses only 5% of the projection samples. Despite the reduced number of projections, IW-SWD achieves comparable convergence behavior, demonstrating its efficiency in guiding 3D stylization with significantly fewer computations.
  • Figure 3: Image-to-Image Multi-View Generation Pipeline. We first obtain depth maps for both tiled representative views and the target views using an off-the-shelf depth prediction model. The tiled depth maps are then provided as conditioning input to a depth-guided ControlNet attached to a Stable Diffusion XL (SDXL) model. To ensure consistent appearance across viewpoints, the diffusion model incorporates an attention-sharing mechanism anchored on the reference tiled depth map, enabling coherent stylization of the target multi-view images.
  • Figure 4: Method Comparison. We compare our method against Style-NeRF2NeRF fujiwara2024sn2n and DGE chen2024dge. As shown, our approach produces clearer and more visually artistic results that are faithful to the given style prompts, while exhibiting fewer artifacts. For a fair comparison, we replace the underlying NeRF representation in the Style-NeRF2NeRF baseline with the same 2D Gaussian Splatting (2DGS) huang20242d used in our method.
  • Figure 5: Ablation results for 3D style transfer using the prompt "A painting of a blue bear." Without our multi-region loss, stylization exhibits color bleeding, with blue regions spilling outside the bear. Additionally, our multi-view generation pipeline produces sharper results with fewer artifacts, demonstrating improved consistency and fidelity in the stylized 3D scene.
  • ...and 4 more figures