Table of Contents
Fetching ...

Jigsaw3D: Disentangled 3D Style Transfer via Patch Shuffling and Masking

Yuteng Ye, Zheng Zhang, Qinchuan Zhang, Di Wang, Youjia Zhang, Wenxiao Zhang, Wei Yang, Yuan Liu

TL;DR

Jigsaw3D tackles the challenge of transferring 2D stylistic cues to 3D textures while maintaining multi-view consistency and geometric fidelity. It introduces a jigsaw-based style-reference construction to disentangle style from content, enabling supervised training of a multi-view diffusion model that uses geometry cues and reference-to-view cross-attention to instill style consistently across views. The method includes a 3D style baking step to fuse stylized views into a seamless UV texture, and it demonstrates strong style fidelity, cross-view coherence, and versatility across partial stylization, multi-object scenes, and tileable textures. This approach offers scalable, fast 3D stylization without per-asset optimization and broad applicability to practical content creation workflows.

Abstract

Controllable 3D style transfer seeks to restyle a 3D asset so that its textures match a reference image while preserving the integrity and multi-view consistency. The prevalent methods either rely on direct reference style token injection or score-distillation from 2D diffusion models, which incurs heavy per-scene optimization and often entangles style with semantic content. We introduce Jigsaw3D, a multi-view diffusion based pipeline that decouples style from content and enables fast, view-consistent stylization. Our key idea is to leverage the jigsaw operation - spatial shuffling and random masking of reference patches - to suppress object semantics and isolate stylistic statistics (color palettes, strokes, textures). We integrate these style cues into a multi-view diffusion model via reference-to-view cross-attention, producing view-consistent stylized renderings conditioned on the input mesh. The renders are then style-baked onto the surface to yield seamless textures. Across standard 3D stylization benchmarks, Jigsaw3D achieves high style fidelity and multi-view consistency with substantially lower latency, and generalizes to masked partial reference stylization, multi-object scene styling, and tileable texture generation. Project page is available at: https://babahui.github.io/jigsaw3D.github.io/

Jigsaw3D: Disentangled 3D Style Transfer via Patch Shuffling and Masking

TL;DR

Jigsaw3D tackles the challenge of transferring 2D stylistic cues to 3D textures while maintaining multi-view consistency and geometric fidelity. It introduces a jigsaw-based style-reference construction to disentangle style from content, enabling supervised training of a multi-view diffusion model that uses geometry cues and reference-to-view cross-attention to instill style consistently across views. The method includes a 3D style baking step to fuse stylized views into a seamless UV texture, and it demonstrates strong style fidelity, cross-view coherence, and versatility across partial stylization, multi-object scenes, and tileable textures. This approach offers scalable, fast 3D stylization without per-asset optimization and broad applicability to practical content creation workflows.

Abstract

Controllable 3D style transfer seeks to restyle a 3D asset so that its textures match a reference image while preserving the integrity and multi-view consistency. The prevalent methods either rely on direct reference style token injection or score-distillation from 2D diffusion models, which incurs heavy per-scene optimization and often entangles style with semantic content. We introduce Jigsaw3D, a multi-view diffusion based pipeline that decouples style from content and enables fast, view-consistent stylization. Our key idea is to leverage the jigsaw operation - spatial shuffling and random masking of reference patches - to suppress object semantics and isolate stylistic statistics (color palettes, strokes, textures). We integrate these style cues into a multi-view diffusion model via reference-to-view cross-attention, producing view-consistent stylized renderings conditioned on the input mesh. The renders are then style-baked onto the surface to yield seamless textures. Across standard 3D stylization benchmarks, Jigsaw3D achieves high style fidelity and multi-view consistency with substantially lower latency, and generalizes to masked partial reference stylization, multi-object scene styling, and tileable texture generation. Project page is available at: https://babahui.github.io/jigsaw3D.github.io/

Paper Structure

This paper contains 18 sections, 14 equations, 16 figures, 1 table.

Figures (16)

  • Figure 1: We propose the JIGSAW3D, a versatile 3D stylization framework that transfers stylistic statistics from 2D images to 3D meshes. Our method achieves high stylistic consistency across multiple, diverse objects in a scene (top). Furthermore, it demonstrates high versatility with various art styles and supports partial reference stylization for fine-grained user control (bottom).
  • Figure 2: Our Method Pipeline. The whole framework contains multi-view stylized image generation and 3D style baking. Multi-View Style Generation: position and normal maps from the mesh $M$ are encoded and injected into a style U-Net via feature modulation, while the reference image $I$ is processed by a jigsaw operation involving image patch shuffling and random masking to extract style features. These style features are sent to a pre-trained reference U-Net to extract intermediate features that serve as keys and values in a reference attention module. Our style U-Net uses reference attention for aligning with the reference style and multi-view attention to ensure cross-view consistency. 3D Style Baking: The generated multi-view images are projected onto the mesh's UV space, yielding a seamless UV map ready for final rendering.
  • Figure 3: Analysis of style-content disentanglement through patch shuffling and masking. We apply different degrees of shuffling and a fixed mask ratio. Left: Quantitative evaluation of content and style attributes under increasing shuffle intensity. As $N$ (number of divisions per image side) increases, the CNN-based classification score (blue line) of shuffled images decreases sharply. At $N=8$, semantic content is almost entirely lost. Meanwhile, the Gram matrix similarity gatys2016image (denoted as green dashed line) calculated between shuffled images and source images increases gradually for $N \leq 8$, indicating well-preserved style fidelity. The setting $N=8$ strikes a good balance between semantic suppression and style preservation. Right: Visual examples of shuffled images using different values of $N$ and a fixed mask ratio.
  • Figure 4: Qualitative comparison between 3D stylization methods on our collected dataset and WikiArt. The left side of the dashed line displays the input object mesh and reference image. On the right, four groups of comparative results are shown, and each group has two selected viewpoints.
  • Figure 5: Ablation study on the Jigsaw module. The left side shows the input object mesh and reference style image. The right side presents groups of stylization results under different Jigsaw settings: (a) w/o Train & Infer Jigsaw: training and reference process without jigsaw operation; (b) w/o Infer Jigsaw: only inference process without jigsaw operation; (c) w/ Train & Infer Jigsaw (Ours): our approach applies the jigsaw operation in both training and inference phases.
  • ...and 11 more figures