Table of Contents
Fetching ...

Style3D: Attention-guided Multi-view Style Transfer for 3D Object Generation

Bingjie Song, Xin Huang, Ruting Xie, Xue Wang, Qing Wang

TL;DR

Style3D tackles the problem of instantly stylizing 3D objects from a content-style image pair without style-specific training. It decomposes the task into two stages: Multi-View Dual-Feature Alignment, which uses a MultiFusion Attention mechanism to anchor geometry with content queries while injecting style via keys/values; and Sparse-view Spatial Reconstruction, which reconstructs a coherent 3D object from stylized multi-view features using a triplane/SDF-based representation and FlexiCubes mesh extraction. The method achieves high stylistic fidelity and geometric coherence across views, outperforming baselines in realism, coherence, and CLIP-based alignment, while offering significantly faster generation (about 30 seconds per object). These results demonstrate the practical potential for rapid, scalable creation of style-consistent 3D assets in design, gaming, and VR applications, without the heavy retraining typical of prior approaches.

Abstract

We present Style3D, a novel approach for generating stylized 3D objects from a content image and a style image. Unlike most previous methods that require case- or style-specific training, Style3D supports instant 3D object stylization. Our key insight is that 3D object stylization can be decomposed into two interconnected processes: multi-view dual-feature alignment and sparse-view spatial reconstruction. We introduce MultiFusion Attention, an attention-guided technique to achieve multi-view stylization from the content-style pair. Specifically, the query features from the content image preserve geometric consistency across multiple views, while the key and value features from the style image are used to guide the stylistic transfer. This dual-feature alignment ensures that spatial coherence and stylistic fidelity are maintained across multi-view images. Finally, a large 3D reconstruction model is introduced to generate coherent stylized 3D objects. By establishing an interplay between structural and stylistic features across multiple views, our approach enables a holistic 3D stylization process. Extensive experiments demonstrate that Style3D offers a more flexible and scalable solution for generating style-consistent 3D assets, surpassing existing methods in both computational efficiency and visual quality.

Style3D: Attention-guided Multi-view Style Transfer for 3D Object Generation

TL;DR

Style3D tackles the problem of instantly stylizing 3D objects from a content-style image pair without style-specific training. It decomposes the task into two stages: Multi-View Dual-Feature Alignment, which uses a MultiFusion Attention mechanism to anchor geometry with content queries while injecting style via keys/values; and Sparse-view Spatial Reconstruction, which reconstructs a coherent 3D object from stylized multi-view features using a triplane/SDF-based representation and FlexiCubes mesh extraction. The method achieves high stylistic fidelity and geometric coherence across views, outperforming baselines in realism, coherence, and CLIP-based alignment, while offering significantly faster generation (about 30 seconds per object). These results demonstrate the practical potential for rapid, scalable creation of style-consistent 3D assets in design, gaming, and VR applications, without the heavy retraining typical of prior approaches.

Abstract

We present Style3D, a novel approach for generating stylized 3D objects from a content image and a style image. Unlike most previous methods that require case- or style-specific training, Style3D supports instant 3D object stylization. Our key insight is that 3D object stylization can be decomposed into two interconnected processes: multi-view dual-feature alignment and sparse-view spatial reconstruction. We introduce MultiFusion Attention, an attention-guided technique to achieve multi-view stylization from the content-style pair. Specifically, the query features from the content image preserve geometric consistency across multiple views, while the key and value features from the style image are used to guide the stylistic transfer. This dual-feature alignment ensures that spatial coherence and stylistic fidelity are maintained across multi-view images. Finally, a large 3D reconstruction model is introduced to generate coherent stylized 3D objects. By establishing an interplay between structural and stylistic features across multiple views, our approach enables a holistic 3D stylization process. Extensive experiments demonstrate that Style3D offers a more flexible and scalable solution for generating style-consistent 3D assets, surpassing existing methods in both computational efficiency and visual quality.

Paper Structure

This paper contains 20 sections, 5 equations, 13 figures, 3 tables.

Figures (13)

  • Figure 1: Method overview. Given two input images, one serving as target content and the other as target style, we first perform Multi-view Dual-feature Alignment in the first stage. This involves extracting content features and style features in the multi-view diffusion process, which separately anchor the geometric and stylistic characteristics. Then these features are fused using an attention mechanism to generate multiple stylized views of the object. In the second stage, we leverage Sparse-view Spatial Reconstruction, where the generated multi-view images are passed through a feature encoder and decoded into a 3D object. The decoder works by integrating spatial and stylistic features across a triplane presentation to produce a coherent 3D mesh. The entire process seamlessly integrates style and geometry while maintaining high computational efficiency, resulting in a stylized 3D object as the final output.
  • Figure 2: MultiFusion attention mechanism. This is designed to align two distinct feature sets to maintain spatial and semantic coherence, by anchoring content-derived query features for geometric consistency across views and infusing style-derived key-value features for high-dimensional texture details.
  • Figure 3: The procedure of the stylized 3D Reconstruction. Multi-view images generated in the first stage are used to reconstruct a high-quality 3D object, leveraging SDF-based implicit representations to accurately encode 3D geometry and ensure smooth, flexible surface definitions.
  • Figure 4: Qualitative comparison on texture generation. Style3D is compared with SOTA texture generation methods richardson2023texturetextguidedtexturing3dyouwang2024paintittexttotexturesynthesisdeep. Given the mesh and the corresponding prompt as used in baseline methods, we render a front-view image as the content image and pair it with a image generated from the prompt as the input. Style3D demonstrates a comparable performance in generating consistent and visually natural 3D textured objects.
  • Figure 5: Qualitative comparison with generation methods on stylization. The figure illustrates the superior ability of Style3D to adapt and transfer style characteristics, with a content image rendered from the input mesh and a style image generated by the prompt as input, addressing the gaps of existing methods.
  • ...and 8 more figures