Table of Contents
Fetching ...

Styl3R: Instant 3D Stylized Reconstruction for Arbitrary Scenes and Styles

Peng Wang, Xiang Liu, Peidong Liu

TL;DR

Styl3R addresses fast, multi-view consistent 3D stylization from sparse unposed views by introducing a dual-branch network that decouples structure and appearance. The structure branch reconstructs 3D geometry using a dense prior, while the appearance branch stylizes color through cross attention with a style image. A two-stage training curriculum with novel view synthesis pre-training and stylization fine-tuning plus an identity loss preserves geometry and enables zero-shot stylization. Across in-domain and out-of-domain datasets, Styl3R achieves state-of-the-art zero-shot stylization with substantially faster inference, enabling interactive applications though it currently supports static scenes.

Abstract

Stylizing 3D scenes instantly while maintaining multi-view consistency and faithfully resembling a style image remains a significant challenge. Current state-of-the-art 3D stylization methods typically involve computationally intensive test-time optimization to transfer artistic features into a pretrained 3D representation, often requiring dense posed input images. In contrast, leveraging recent advances in feed-forward reconstruction models, we demonstrate a novel approach to achieve direct 3D stylization in less than a second using unposed sparse-view scene images and an arbitrary style image. To address the inherent decoupling between reconstruction and stylization, we introduce a branched architecture that separates structure modeling and appearance shading, effectively preventing stylistic transfer from distorting the underlying 3D scene structure. Furthermore, we adapt an identity loss to facilitate pre-training our stylization model through the novel view synthesis task. This strategy also allows our model to retain its original reconstruction capabilities while being fine-tuned for stylization. Comprehensive evaluations, using both in-domain and out-of-domain datasets, demonstrate that our approach produces high-quality stylized 3D content that achieve a superior blend of style and scene appearance, while also outperforming existing methods in terms of multi-view consistency and efficiency.

Styl3R: Instant 3D Stylized Reconstruction for Arbitrary Scenes and Styles

TL;DR

Styl3R addresses fast, multi-view consistent 3D stylization from sparse unposed views by introducing a dual-branch network that decouples structure and appearance. The structure branch reconstructs 3D geometry using a dense prior, while the appearance branch stylizes color through cross attention with a style image. A two-stage training curriculum with novel view synthesis pre-training and stylization fine-tuning plus an identity loss preserves geometry and enables zero-shot stylization. Across in-domain and out-of-domain datasets, Styl3R achieves state-of-the-art zero-shot stylization with substantially faster inference, enabling interactive applications though it currently supports static scenes.

Abstract

Stylizing 3D scenes instantly while maintaining multi-view consistency and faithfully resembling a style image remains a significant challenge. Current state-of-the-art 3D stylization methods typically involve computationally intensive test-time optimization to transfer artistic features into a pretrained 3D representation, often requiring dense posed input images. In contrast, leveraging recent advances in feed-forward reconstruction models, we demonstrate a novel approach to achieve direct 3D stylization in less than a second using unposed sparse-view scene images and an arbitrary style image. To address the inherent decoupling between reconstruction and stylization, we introduce a branched architecture that separates structure modeling and appearance shading, effectively preventing stylistic transfer from distorting the underlying 3D scene structure. Furthermore, we adapt an identity loss to facilitate pre-training our stylization model through the novel view synthesis task. This strategy also allows our model to retain its original reconstruction capabilities while being fine-tuned for stylization. Comprehensive evaluations, using both in-domain and out-of-domain datasets, demonstrate that our approach produces high-quality stylized 3D content that achieve a superior blend of style and scene appearance, while also outperforming existing methods in terms of multi-view consistency and efficiency.

Paper Structure

This paper contains 41 sections, 2 equations, 15 figures, 6 tables.

Figures (15)

  • Figure 1: Styl3R. Given unposed sparse-view images and an arbitrary style image, our method predicts stylized 3D Gaussians in less than a second using a feed-forward network.
  • Figure 2: Overview of Styl3R. Our model consists of a structure branch and an appearance branch that output different attributes of Gaussians. For the structure branch, sparse unposed images are encoded by a shared content encoder, then content tokens of each image are separately fed into their structure decoders with information sharing between other views. Attributes that govern the structure of the scene are then regressed from structure heads. For the color branch, a style image is encoded by the style encoder, then the output style tokens perform cross attention with content tokens from all viewpoints in the stylization decoder. Finally the color of Gaussians are predicted from these blended tokens output by this decoder, which compose all Gaussian parameters along with the output from structure branch. Apart from style image, the appearance branch can also accept a content image which gives the Gaussians their original colors.
  • Figure 2: Quantitative Results. Performance comparison of Styl3R with 2D and 3D baselines on RE10K in terms of view consistency. Stylization time refers to processing time excluding IO time.
  • Figure 3: Novel View Transfer Comparision on RE10K. Despite limited image overlap, our method generates stylized novel views that more faithfully capture style details while preserving the original scene structure. In comparison, StyleRF liu2023stylerf and StyleGaussian liu2024stylegaussian tend to produce over-smoothed results that deviate from the true color tone of the reference style. ARF zhang2022arf suffers from style overflow, leading to significant loss of content appearance. As a 2D baseline, StyTr2 deng2022stytr2 operates directly on ground-truth novel views, but fails to retain fine structural details of the scene.
  • Figure 4: Cross-dataset generalization on Tanks and Temples dataset. Our model achieves superior or comparable zero-shot style transfer on out-of-distribution data, outperforming style-free baselines such as StyleRF liu2023stylerf and StyleGaussian liu2024stylegaussian that require per-scene optimization, and matching the performance of ARF zhang2022arf, which further demands per-scene and per-style optimization.
  • ...and 10 more figures