Table of Contents
Fetching ...

AnyStyle: Single-Pass Multimodal Stylization for 3D Gaussian Splatting

Joanna Kaleta, Bartosz Świrta, Kacper Kania, Przemysław Spurek, Marek Kowalski

TL;DR

AnyStyle tackles fast, pose-free stylization of 3D scenes represented by 3D Gaussian Splatting by introducing a modular, architecture-agnostic Style Branch that couples with a frozen AnySplat backbone. Style control is multimodal and zero-shot, driven by Long-CLIP embeddings that support both text and image inputs and allow smooth interpolation between styles. A zero-initialized Style Injection mechanism enables additive conditioning without retraining the backbone, preserving geometry while achieving expressive appearance changes; CLIP-based and perceptual losses guide stylization across views. Empirical evaluation shows state-of-the-art stylization quality among feed-forward methods, strong multi-view consistency, and clear benefits from text-conditioned control and style interpolation, enabling practical, flexible 3D stylization for rapid asset creation.

Abstract

The growing demand for rapid and scalable 3D asset creation has driven interest in feed-forward 3D reconstruction methods, with 3D Gaussian Splatting (3DGS) emerging as an effective scene representation. While recent approaches have demonstrated pose-free reconstruction from unposed image collections, integrating stylization or appearance control into such pipelines remains underexplored. Existing attempts largely rely on image-based conditioning, which limits both controllability and flexibility. In this work, we introduce AnyStyle, a feed-forward 3D reconstruction and stylization framework that enables pose-free, zero-shot stylization through multimodal conditioning. Our method supports both textual and visual style inputs, allowing users to control the scene appearance using natural language descriptions or reference images. We propose a modular stylization architecture that requires only minimal architectural modifications and can be integrated into existing feed-forward 3D reconstruction backbones. Experiments demonstrate that AnyStyle improves style controllability over prior feed-forward stylization methods while preserving high-quality geometric reconstruction. A user study further confirms that AnyStyle achieves superior stylization quality compared to an existing state-of-the-art approach. Repository: https://github.com/joaxkal/AnyStyle.

AnyStyle: Single-Pass Multimodal Stylization for 3D Gaussian Splatting

TL;DR

AnyStyle tackles fast, pose-free stylization of 3D scenes represented by 3D Gaussian Splatting by introducing a modular, architecture-agnostic Style Branch that couples with a frozen AnySplat backbone. Style control is multimodal and zero-shot, driven by Long-CLIP embeddings that support both text and image inputs and allow smooth interpolation between styles. A zero-initialized Style Injection mechanism enables additive conditioning without retraining the backbone, preserving geometry while achieving expressive appearance changes; CLIP-based and perceptual losses guide stylization across views. Empirical evaluation shows state-of-the-art stylization quality among feed-forward methods, strong multi-view consistency, and clear benefits from text-conditioned control and style interpolation, enabling practical, flexible 3D stylization for rapid asset creation.

Abstract

The growing demand for rapid and scalable 3D asset creation has driven interest in feed-forward 3D reconstruction methods, with 3D Gaussian Splatting (3DGS) emerging as an effective scene representation. While recent approaches have demonstrated pose-free reconstruction from unposed image collections, integrating stylization or appearance control into such pipelines remains underexplored. Existing attempts largely rely on image-based conditioning, which limits both controllability and flexibility. In this work, we introduce AnyStyle, a feed-forward 3D reconstruction and stylization framework that enables pose-free, zero-shot stylization through multimodal conditioning. Our method supports both textual and visual style inputs, allowing users to control the scene appearance using natural language descriptions or reference images. We propose a modular stylization architecture that requires only minimal architectural modifications and can be integrated into existing feed-forward 3D reconstruction backbones. Experiments demonstrate that AnyStyle improves style controllability over prior feed-forward stylization methods while preserving high-quality geometric reconstruction. A user study further confirms that AnyStyle achieves superior stylization quality compared to an existing state-of-the-art approach. Repository: https://github.com/joaxkal/AnyStyle.
Paper Structure (32 sections, 11 equations, 20 figures, 6 tables)

This paper contains 32 sections, 11 equations, 20 figures, 6 tables.

Figures (20)

  • Figure 1: Teaser. Given a set of unposed input images and a style conditioning signal provided as either text or an image, our method generates a stylized 3D scene represented with 3D Gaussian Splats in a single forward pass. The reconstructed scene can be stylized in under 0.1 second per input content image.
  • Figure 2: Method overview. AnyStyle takes unposed content images of a scene together with an arbitrary style input (text or image) and produces a stylized 3D Gaussian representation from which novel stylized views can be rendered. The architecture follows a dual-branch design that decouples geometric reconstruction from appearance stylization. Content images are processed by a pretrained frozen backbone to recover geometry and camera poses, while the style signal is embedded using CLIP and applied only within the style branch to control scene appearance. A pretrained AnySplat model initializes the copied Aggregator $\tilde{A}$ and Gaussian Head $\tilde{H}_{gs}$ in the style branch. These components are subsequently fine-tuned with CLIP-based conditioning via style injection. The outputs of the two branches are combined by a Gaussian Adapter and rendered to produce the final stylized views.
  • Figure 3: Comparison between AnyStyle and existing 3D style transfer methods with different architectural designs: feed-forward (purple), per-scene optimization (green), and hybrid approaches (blue). Our method achieves high-quality style transfer while faithfully preserving fine details (top row) as well as overall scene structure. All compared methods are conditioned on a style image.
  • Figure 4: Stylization using text prompts. We compare AnyStyle with ClipGaussian howil2025clipgaussian, which requires per-scene optimization (>20min). Despite using identical text prompts, ClipGaussian introduces semantic artifacts from the style input.
  • Figure 5: Stylization with text prompts vs. images. We compare renderings conditioned either on a reference test style image or on a textual description generated by Mini-CPM-V4.5 for that image. Our method achieves coherent and plausible stylization across both modalities. Please note that due to the inherently lower amount of information encoded in text prompts and more ambiguous nature of natural language, text-based conditioning cannot reproduce the rendered appearance exactly the same as image-based conditioning.
  • ...and 15 more figures