Table of Contents
Fetching ...

ArtNVG: Content-Style Separated Artistic Neighboring-View Gaussian Stylization

Zixiao Gu, Mengtian Li, Ruhua Chen, Zhongxia Ji, Sichen Guo, Zhenye Zhang, Guangnan Ye, Zuo Hu

TL;DR

ArtNVG addresses the challenge of stylizing 3D Gaussian Splatting scenes with a target style image while preserving content and ensuring local color/texture coherence. It introduces Content-Style Separated Control to decouple content and style influences and employs Attention-based Neighboring-View Alignment to enforce cross-view consistency during diffusion-driven stylization. The framework leverages 3DGS, CSGO-style/content projections, Tile ControlNet, and a Neighboring-View diffusion model, achieving fast optimization and high reconstruction quality. Empirical results on Tanks and Temples with WikiArt styles show superior content fidelity, style alignment, and multi-view consistency compared to StyleGaussian and InstantStyleGaussian, with a total stylization time around 20 minutes. This approach enables robust, scalable 3D style transfer suitable for production pipelines in film, gaming, and immersive media.

Abstract

As demand from the film and gaming industries for 3D scenes with target styles grows, the importance of advanced 3D stylization techniques increases. However, recent methods often struggle to maintain local consistency in color and texture throughout stylized scenes, which is essential for maintaining aesthetic coherence. To solve this problem, this paper introduces ArtNVG, an innovative 3D stylization framework that efficiently generates stylized 3D scenes by leveraging reference style images. Built on 3D Gaussian Splatting (3DGS), ArtNVG achieves rapid optimization and rendering while upholding high reconstruction quality. Our framework realizes high-quality 3D stylization by incorporating two pivotal techniques: Content-Style Separated Control and Attention-based Neighboring-View Alignment. Content-Style Separated Control uses the CSGO model and the Tile ControlNet to decouple the content and style control, reducing risks of information leakage. Concurrently, Attention-based Neighboring-View Alignment ensures consistency of local colors and textures across neighboring views, significantly improving visual quality. Extensive experiments validate that ArtNVG surpasses existing methods, delivering superior results in content preservation, style alignment, and local consistency.

ArtNVG: Content-Style Separated Artistic Neighboring-View Gaussian Stylization

TL;DR

ArtNVG addresses the challenge of stylizing 3D Gaussian Splatting scenes with a target style image while preserving content and ensuring local color/texture coherence. It introduces Content-Style Separated Control to decouple content and style influences and employs Attention-based Neighboring-View Alignment to enforce cross-view consistency during diffusion-driven stylization. The framework leverages 3DGS, CSGO-style/content projections, Tile ControlNet, and a Neighboring-View diffusion model, achieving fast optimization and high reconstruction quality. Empirical results on Tanks and Temples with WikiArt styles show superior content fidelity, style alignment, and multi-view consistency compared to StyleGaussian and InstantStyleGaussian, with a total stylization time around 20 minutes. This approach enables robust, scalable 3D style transfer suitable for production pipelines in film, gaming, and immersive media.

Abstract

As demand from the film and gaming industries for 3D scenes with target styles grows, the importance of advanced 3D stylization techniques increases. However, recent methods often struggle to maintain local consistency in color and texture throughout stylized scenes, which is essential for maintaining aesthetic coherence. To solve this problem, this paper introduces ArtNVG, an innovative 3D stylization framework that efficiently generates stylized 3D scenes by leveraging reference style images. Built on 3D Gaussian Splatting (3DGS), ArtNVG achieves rapid optimization and rendering while upholding high reconstruction quality. Our framework realizes high-quality 3D stylization by incorporating two pivotal techniques: Content-Style Separated Control and Attention-based Neighboring-View Alignment. Content-Style Separated Control uses the CSGO model and the Tile ControlNet to decouple the content and style control, reducing risks of information leakage. Concurrently, Attention-based Neighboring-View Alignment ensures consistency of local colors and textures across neighboring views, significantly improving visual quality. Extensive experiments validate that ArtNVG surpasses existing methods, delivering superior results in content preservation, style alignment, and local consistency.

Paper Structure

This paper contains 23 sections, 13 equations, 10 figures, 2 tables.

Figures (10)

  • Figure 1: ArtNVG. Our method apply the style of a reference image to a 3D Gaussian Splatting (3DGS) scene (Left). This is realized by stylizing the renderings of 3DGS and finetuning the original scene (Right). Our contribution is a content-style separated neighboring-view aligned stylization framework, which primarily reduce the risk of information leakage and improve consistency of local colors and textures.
  • Figure 2: (a) Left: Overview of ArtNVG. (b) Right: Architecture of the Neighboring-View Attention Layer. Given an original 3DGS scene $\mathcal{G}$, we first render the content images $I_c$ from it. Then, we encode the content images $I_c$ and style image $I_\mathrm{sty}$ separately to get the content and style control. Meanwhile, neighboring views of content images are clustered into groups and sampled together in the Neighboring-View Diffusion Model. Finally, stylized images are used to finetune $\mathcal{G}$ and get the stylized 3DGS scene $\mathcal{G}_s$.
  • Figure 3: Comparative Analysis of Detail Consistency in Neighboring Views with and without Neighboring-View Attention. This illustration presents a comparative visualization of stylization results between two neighboring views. The regions demarcated by red bounding boxes clearly exhibit enhanced detail consistency when NV Attention is implemented, in contrast to the results obtained without NV Attention.
  • Figure 4: Qualitative Results of Comparison. We show diverse results of stylization in various scenes with different styles. Compared to SOTA methods StyleGaussian and InstantStyleGaussian, our method achieves the best visual quality by generating locally consistent scenes with better style alignment while preserving high content fidelity.
  • Figure 5: User Study. We record the user preference of our method and the baselines. Our method obtains more preference in content fidelity, style alignment, and visual quality than the baselines.
  • ...and 5 more figures