Stylos: Multi-View 3D Stylization with Single-Forward Gaussian Splatting

Hanzhou Liu; Jia Huang; Mi Lu; Srikanth Saripalli; Peng Jiang

Stylos: Multi-View 3D Stylization with Single-Forward Gaussian Splatting

Hanzhou Liu, Jia Huang, Mi Lu, Srikanth Saripalli, Peng Jiang

TL;DR

Stylos addresses 3D style transfer from unposed multi-view content by predicting a stylized 3D Gaussian scene $G=\{(p_m,c_m)\}_{m=1}^M$ and per-view cameras in a single forward pass. It uses a Transformer backbone with a Style Aggregator that performs global cross-attention to condition color embeddings on a single style image while geometry remains backbone-driven, and introduces a voxel-space 3D style loss to align multi-view features with style statistics. The method achieves zero-shot generalization to unseen categories, scenes, and styles, and outperforms state-of-the-art StyleRF and StyleGS in cross-view consistency and stylistic quality on Tanks and Temples. The work enables scalable, geometry-aware 3D stylization without per-scene optimization, with reproducibility through architecture details, public datasets, and loss pseudo-codes. Key contributions include the shared-backbone dual-path design, the voxel-space 3D style loss, and the single-forward Stylos pipeline from unposed inputs.

Abstract

We present Stylos, a single-forward 3D Gaussian framework for 3D style transfer that operates on unposed content, from a single image to a multi-view collection, conditioned on a separate reference style image. Stylos synthesizes a stylized 3D Gaussian scene without per-scene optimization or precomputed poses, achieving geometry-aware, view-consistent stylization that generalizes to unseen categories, scenes, and styles. At its core, Stylos adopts a Transformer backbone with two pathways: geometry predictions retain self-attention to preserve geometric fidelity, while style is injected via global cross-attention to enforce visual consistency across views. With the addition of a voxel-based 3D style loss that aligns aggregated scene features to style statistics, Stylos enforces view-consistent stylization while preserving geometry. Experiments across multiple datasets demonstrate that Stylos delivers high-quality zero-shot stylization, highlighting the effectiveness of global style-content coupling, the proposed 3D style loss, and the scalability of our framework from single view to large-scale multi-view settings.

Stylos: Multi-View 3D Stylization with Single-Forward Gaussian Splatting

TL;DR

Stylos addresses 3D style transfer from unposed multi-view content by predicting a stylized 3D Gaussian scene

and per-view cameras in a single forward pass. It uses a Transformer backbone with a Style Aggregator that performs global cross-attention to condition color embeddings on a single style image while geometry remains backbone-driven, and introduces a voxel-space 3D style loss to align multi-view features with style statistics. The method achieves zero-shot generalization to unseen categories, scenes, and styles, and outperforms state-of-the-art StyleRF and StyleGS in cross-view consistency and stylistic quality on Tanks and Temples. The work enables scalable, geometry-aware 3D stylization without per-scene optimization, with reproducibility through architecture details, public datasets, and loss pseudo-codes. Key contributions include the shared-backbone dual-path design, the voxel-space 3D style loss, and the single-forward Stylos pipeline from unposed inputs.

Stylos: Multi-View 3D Stylization with Single-Forward Gaussian Splatting

TL;DR

Abstract

Stylos: Multi-View 3D Stylization with Single-Forward Gaussian Splatting

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (16)