Table of Contents
Fetching ...

RegionRoute: Regional Style Transfer with Diffusion Model

Bowen Chen, Jake Zuena, Alan C. Bovik, Divya Kothandaraman

TL;DR

An attention-supervised diffusion framework that explicitly teaches the model where to apply a given style by aligning the attention scores of style tokens with object masks during training is proposed, producing regionally accurate and visually coherent results that outperform existing diffusion-based editing approaches.

Abstract

Precise spatial control in diffusion-based style transfer remains challenging. This challenge arises because diffusion models treat style as a global feature and lack explicit spatial grounding of style representations, making it difficult to restrict style application to specific objects or regions. To our knowledge, existing diffusion models are unable to perform true localized style transfer, typically relying on handcrafted masks or multi-stage post-processing that introduce boundary artifacts and limit generalization. To address this, we propose an attention-supervised diffusion framework that explicitly teaches the model where to apply a given style by aligning the attention scores of style tokens with object masks during training. Two complementary objectives, a Focus loss based on KL divergence and a Cover loss using binary cross-entropy, jointly encourage accurate localization and dense coverage. A modular LoRA-MoE design further enables efficient and scalable multi-style adaptation. To evaluate localized stylization, we introduce the Regional Style Editing Score, which measures Regional Style Matching through CLIP-based similarity within the target region and Identity Preservation via masked LPIPS and pixel-level consistency on unedited areas. Experiments show that our method achieves mask-free, single-object style transfer at inference, producing regionally accurate and visually coherent results that outperform existing diffusion-based editing approaches.

RegionRoute: Regional Style Transfer with Diffusion Model

TL;DR

An attention-supervised diffusion framework that explicitly teaches the model where to apply a given style by aligning the attention scores of style tokens with object masks during training is proposed, producing regionally accurate and visually coherent results that outperform existing diffusion-based editing approaches.

Abstract

Precise spatial control in diffusion-based style transfer remains challenging. This challenge arises because diffusion models treat style as a global feature and lack explicit spatial grounding of style representations, making it difficult to restrict style application to specific objects or regions. To our knowledge, existing diffusion models are unable to perform true localized style transfer, typically relying on handcrafted masks or multi-stage post-processing that introduce boundary artifacts and limit generalization. To address this, we propose an attention-supervised diffusion framework that explicitly teaches the model where to apply a given style by aligning the attention scores of style tokens with object masks during training. Two complementary objectives, a Focus loss based on KL divergence and a Cover loss using binary cross-entropy, jointly encourage accurate localization and dense coverage. A modular LoRA-MoE design further enables efficient and scalable multi-style adaptation. To evaluate localized stylization, we introduce the Regional Style Editing Score, which measures Regional Style Matching through CLIP-based similarity within the target region and Identity Preservation via masked LPIPS and pixel-level consistency on unedited areas. Experiments show that our method achieves mask-free, single-object style transfer at inference, producing regionally accurate and visually coherent results that outperform existing diffusion-based editing approaches.
Paper Structure (33 sections, 7 equations, 11 figures, 3 tables)

This paper contains 33 sections, 7 equations, 11 figures, 3 tables.

Figures (11)

  • Figure 1: When provided with region-specific editing instructions, our RegionRoute framework more precisely interprets localized style modification prompts and produces visually coherent results. Given prompts such as “Make the man in pixel-art style and keep other areas unchanged,” the baseline image editing model tends to either apply the style globally or distort unrelated regions. Each column shows, from top to bottom: the input context image, the baseline, Flux.1-Kontext flux-kontext output, and our RegionRoute output.
  • Figure 2: Overview of the proposed framework. The upper figure illustrates the overall pipeline based on the pretrained Flux.1-Kontext flux-kontext. Given a context image, a noisy input, and a regional style prompt, image and text tokens are processed through Flux.1-Kontext, where LoRA-MoE modules loramoe adapt attention and projection layers for style-specific learning. The model is optimized with flow matching, focus loss, and cover loss to reconstruct the target stylized image. Style-related attention maps are guided by binary masks, where focus loss concentrates attention within the target area and cover loss ensures spatial coverage for precise localized stylization.
  • Figure 3: Qualitative comparison of state-of-the-art instruction-based image editing methods.
  • Figure 4: Visualization of attention maps under different loss configurations. Using only $\mathcal{L}_{\text{cover}}$ or $\mathcal{L}_{\text{focus}}$ causes attention spillover to nearby area, whereas our full objective focuses on the motorcycle without leakage to surrounding regions, demonstrating its ability to maintain precise and consistent attention.
  • Figure 5: Illustration of the pseudo ground-truth (pseudo-GT) generation process. A diffusion-based style transfer model generates a fully stylized version of the image according to a given style prompt. The stylized region corresponding to the mask is then blended back into the original image using seamless cloning, producing an aligned input–target pair for localized style learning.
  • ...and 6 more figures