Table of Contents
Fetching ...

Attention Distillation: A Unified Approach to Visual Characteristics Transfer

Yang Zhou, Xu Gao, Zichong Chen, Hui Huang

TL;DR

This work tackles transferring visual characteristics from a reference image to new content within latent-diffusion models. It introduces Attention Distillation (AD), a loss that aligns self-attention-based representations between target and reference via $\mathcal{L}_{AD}$ and complements it with a content loss, enabling both optimization-based and sampling-based (AD-guided) synthesis. The approach is extended with an improved VAE decoding strategy and a diffusion-sampling mechanism that uses gradient-based AD guidance, achieving style transfer, appearance transfer, texture synthesis, and style-conditioned text-to-image generation with broad compatibility (e.g., ControlNet). Empirical results across multiple tasks show superior fidelity to references, better structural preservation, and accelerated synthesis compared to state-of-the-art baselines, making AD a flexible, unified framework for example-based image synthesis.

Abstract

Recent advances in generative diffusion models have shown a notable inherent understanding of image style and semantics. In this paper, we leverage the self-attention features from pretrained diffusion networks to transfer the visual characteristics from a reference to generated images. Unlike previous work that uses these features as plug-and-play attributes, we propose a novel attention distillation loss calculated between the ideal and current stylization results, based on which we optimize the synthesized image via backpropagation in latent space. Next, we propose an improved Classifier Guidance that integrates attention distillation loss into the denoising sampling process, further accelerating the synthesis and enabling a broad range of image generation applications. Extensive experiments have demonstrated the extraordinary performance of our approach in transferring the examples' style, appearance, and texture to new images in synthesis. Code is available at https://github.com/xugao97/AttentionDistillation.

Attention Distillation: A Unified Approach to Visual Characteristics Transfer

TL;DR

This work tackles transferring visual characteristics from a reference image to new content within latent-diffusion models. It introduces Attention Distillation (AD), a loss that aligns self-attention-based representations between target and reference via and complements it with a content loss, enabling both optimization-based and sampling-based (AD-guided) synthesis. The approach is extended with an improved VAE decoding strategy and a diffusion-sampling mechanism that uses gradient-based AD guidance, achieving style transfer, appearance transfer, texture synthesis, and style-conditioned text-to-image generation with broad compatibility (e.g., ControlNet). Empirical results across multiple tasks show superior fidelity to references, better structural preservation, and accelerated synthesis compared to state-of-the-art baselines, making AD a flexible, unified framework for example-based image synthesis.

Abstract

Recent advances in generative diffusion models have shown a notable inherent understanding of image style and semantics. In this paper, we leverage the self-attention features from pretrained diffusion networks to transfer the visual characteristics from a reference to generated images. Unlike previous work that uses these features as plug-and-play attributes, we propose a novel attention distillation loss calculated between the ideal and current stylization results, based on which we optimize the synthesized image via backpropagation in latent space. Next, we propose an improved Classifier Guidance that integrates attention distillation loss into the denoising sampling process, further accelerating the synthesis and enabling a broad range of image generation applications. Extensive experiments have demonstrated the extraordinary performance of our approach in transferring the examples' style, appearance, and texture to new images in synthesis. Code is available at https://github.com/xugao97/AttentionDistillation.

Paper Structure

This paper contains 32 sections, 12 equations, 26 figures, 1 table.

Figures (26)

  • Figure 1: Given a reference image, our approach can faithfully reproduce its visual characteristics in synthesis, providing a unified framework for a wide range of example-based image synthesis applications, such as artistic style transfer, appearance transfer, style-specific text-to-image generation, and various texture synthesis tasks.
  • Figure 2: Overview of attention distillation. Based on the self-attention mechanism in diffusion models, we compute the difference between the ideal and the current stylization, formulating a novel Attention Distillation (AD) loss (a). The new loss acts like a style loss. When combined with a content loss (also derived from the self-attention mechanism), we can realize high-quality content-preserving synthesis, such as style transfer or appearance transfer (b). Our attention distillation loss can be incorporated into the normal diffusion sampling process as an improved Classifier Guidance (c), which enables a broad scope of example-based image generation applications.
  • Figure 3: Differences between KV-injection and attention distillation. We start with the same latent for sampling and optimization, both running 100 steps, using empty prompts. The information flow (red arrows) differs only from the identity connection. However, the results of our attention distillation optimization (b) are clearly superior to sampling with KV-injection (a).
  • Figure 4: Optimizing attention distillation loss across multiple runs. The coherence in texture and style, and the variations in structure across multiple runs of the same reference, demonstrates the ability of our AD loss in style alignment and spatial adaption.
  • Figure 5: Improved VAE decoding. The pretrained VAE is lossy in high-frequency details. Fine-tuning the VAE with the reference image over several steps (denoted as VAE*) can enhance the reconstruction quality and the decoding for novel image synthesis.
  • ...and 21 more figures