Table of Contents
Fetching ...

Stylos: Multi-View 3D Stylization with Single-Forward Gaussian Splatting

Hanzhou Liu, Jia Huang, Mi Lu, Srikanth Saripalli, Peng Jiang

TL;DR

Stylos addresses 3D style transfer from unposed multi-view content by predicting a stylized 3D Gaussian scene $G=\{(p_m,c_m)\}_{m=1}^M$ and per-view cameras in a single forward pass. It uses a Transformer backbone with a Style Aggregator that performs global cross-attention to condition color embeddings on a single style image while geometry remains backbone-driven, and introduces a voxel-space 3D style loss to align multi-view features with style statistics. The method achieves zero-shot generalization to unseen categories, scenes, and styles, and outperforms state-of-the-art StyleRF and StyleGS in cross-view consistency and stylistic quality on Tanks and Temples. The work enables scalable, geometry-aware 3D stylization without per-scene optimization, with reproducibility through architecture details, public datasets, and loss pseudo-codes. Key contributions include the shared-backbone dual-path design, the voxel-space 3D style loss, and the single-forward Stylos pipeline from unposed inputs.

Abstract

We present Stylos, a single-forward 3D Gaussian framework for 3D style transfer that operates on unposed content, from a single image to a multi-view collection, conditioned on a separate reference style image. Stylos synthesizes a stylized 3D Gaussian scene without per-scene optimization or precomputed poses, achieving geometry-aware, view-consistent stylization that generalizes to unseen categories, scenes, and styles. At its core, Stylos adopts a Transformer backbone with two pathways: geometry predictions retain self-attention to preserve geometric fidelity, while style is injected via global cross-attention to enforce visual consistency across views. With the addition of a voxel-based 3D style loss that aligns aggregated scene features to style statistics, Stylos enforces view-consistent stylization while preserving geometry. Experiments across multiple datasets demonstrate that Stylos delivers high-quality zero-shot stylization, highlighting the effectiveness of global style-content coupling, the proposed 3D style loss, and the scalability of our framework from single view to large-scale multi-view settings.

Stylos: Multi-View 3D Stylization with Single-Forward Gaussian Splatting

TL;DR

Stylos addresses 3D style transfer from unposed multi-view content by predicting a stylized 3D Gaussian scene and per-view cameras in a single forward pass. It uses a Transformer backbone with a Style Aggregator that performs global cross-attention to condition color embeddings on a single style image while geometry remains backbone-driven, and introduces a voxel-space 3D style loss to align multi-view features with style statistics. The method achieves zero-shot generalization to unseen categories, scenes, and styles, and outperforms state-of-the-art StyleRF and StyleGS in cross-view consistency and stylistic quality on Tanks and Temples. The work enables scalable, geometry-aware 3D stylization without per-scene optimization, with reproducibility through architecture details, public datasets, and loss pseudo-codes. Key contributions include the shared-backbone dual-path design, the voxel-space 3D style loss, and the single-forward Stylos pipeline from unposed inputs.

Abstract

We present Stylos, a single-forward 3D Gaussian framework for 3D style transfer that operates on unposed content, from a single image to a multi-view collection, conditioned on a separate reference style image. Stylos synthesizes a stylized 3D Gaussian scene without per-scene optimization or precomputed poses, achieving geometry-aware, view-consistent stylization that generalizes to unseen categories, scenes, and styles. At its core, Stylos adopts a Transformer backbone with two pathways: geometry predictions retain self-attention to preserve geometric fidelity, while style is injected via global cross-attention to enforce visual consistency across views. With the addition of a voxel-based 3D style loss that aligns aggregated scene features to style statistics, Stylos enforces view-consistent stylization while preserving geometry. Experiments across multiple datasets demonstrate that Stylos delivers high-quality zero-shot stylization, highlighting the effectiveness of global style-content coupling, the proposed 3D style loss, and the scalability of our framework from single view to large-scale multi-view settings.

Paper Structure

This paper contains 21 sections, 7 equations, 16 figures, 4 tables, 3 algorithms.

Figures (16)

  • Figure 1: Architecture overview. Given multi-view content inputs and a style reference, Stylos enables instant 3D stylization without scene-specific training or post-optimization. A key component is the 3D style loss, matching voxelized 3D features with 2D style statistics.
  • Figure 2: CO3D pizza scene comparing different style–content cross-attention strategies.
  • Figure 3: Comparison of style losses on unseen donut, skateboard, and pizza scenes from the CO3D dataset. Both scene and 3D style losses yield cleaner stylized textures compared to image-level matching, while the 3D loss further conveys a stronger sense of 3D geometry.
  • Figure 4: Effect of varying # views / batch on the Lighthouse scene from Tanks and Temples.
  • Figure 5: Visual comparison between Stylos and recent per-scene 3D stylization baselines.
  • ...and 11 more figures