Table of Contents
Fetching ...

Precise Parameter Localization for Textual Generation in Diffusion Models

Łukasz Staniszewski, Bartosz Cywiński, Franziska Boenisch, Kamil Deja, Adam Dziedzic

TL;DR

This work tackles the challenge of understanding and controlling textual content in diffusion-model generated images by localizing a tiny subset of cross and joint attention parameters. It introduces an activation-patching method that identifies a small fraction of parameters—$0.61\%$ for SDXL, $0.21\%$ for DeepFloyd IF, and $0.23\%$ for SD3—responsible for text generation across diverse architectures and encoders. The authors demonstrate three practical benefits: targeted LoRA fine-tuning on localized layers improves text quality without sacrificing visual fidelity, patching enables precise text editing within images, and on-the-fly text substitution can prevent toxic text while preserving other visuals. The approach is architecture-agnostic and compatible with multiple text encoders (e.g., CLIP, T5), offering a scalable path to safer, more efficient text-to-image generation. These findings advance both the interpretability and practical control of diffusion-based text rendering in images, with immediate implications for editing, safety, and customization.

Abstract

Novel diffusion models can synthesize photo-realistic images with integrated high-quality text. Surprisingly, we demonstrate through attention activation patching that only less than 1% of diffusion models' parameters, all contained in attention layers, influence the generation of textual content within the images. Building on this observation, we improve textual generation efficiency and performance by targeting cross and joint attention layers of diffusion models. We introduce several applications that benefit from localizing the layers responsible for textual content generation. We first show that a LoRA-based fine-tuning solely of the localized layers enhances, even more, the general text-generation capabilities of large diffusion models while preserving the quality and diversity of the diffusion models' generations. Then, we demonstrate how we can use the localized layers to edit textual content in generated images. Finally, we extend this idea to the practical use case of preventing the generation of toxic text in a cost-free manner. In contrast to prior work, our localization approach is broadly applicable across various diffusion model architectures, including U-Net (e.g., LDM and SDXL) and transformer-based (e.g., DeepFloyd IF and Stable Diffusion 3), utilizing diverse text encoders (e.g., from CLIP to the large language models like T5). Project page available at https://t2i-text-loc.github.io/.

Precise Parameter Localization for Textual Generation in Diffusion Models

TL;DR

This work tackles the challenge of understanding and controlling textual content in diffusion-model generated images by localizing a tiny subset of cross and joint attention parameters. It introduces an activation-patching method that identifies a small fraction of parameters— for SDXL, for DeepFloyd IF, and for SD3—responsible for text generation across diverse architectures and encoders. The authors demonstrate three practical benefits: targeted LoRA fine-tuning on localized layers improves text quality without sacrificing visual fidelity, patching enables precise text editing within images, and on-the-fly text substitution can prevent toxic text while preserving other visuals. The approach is architecture-agnostic and compatible with multiple text encoders (e.g., CLIP, T5), offering a scalable path to safer, more efficient text-to-image generation. These findings advance both the interpretability and practical control of diffusion-based text rendering in images, with immediate implications for editing, safety, and customization.

Abstract

Novel diffusion models can synthesize photo-realistic images with integrated high-quality text. Surprisingly, we demonstrate through attention activation patching that only less than 1% of diffusion models' parameters, all contained in attention layers, influence the generation of textual content within the images. Building on this observation, we improve textual generation efficiency and performance by targeting cross and joint attention layers of diffusion models. We introduce several applications that benefit from localizing the layers responsible for textual content generation. We first show that a LoRA-based fine-tuning solely of the localized layers enhances, even more, the general text-generation capabilities of large diffusion models while preserving the quality and diversity of the diffusion models' generations. Then, we demonstrate how we can use the localized layers to edit textual content in generated images. Finally, we extend this idea to the practical use case of preventing the generation of toxic text in a cost-free manner. In contrast to prior work, our localization approach is broadly applicable across various diffusion model architectures, including U-Net (e.g., LDM and SDXL) and transformer-based (e.g., DeepFloyd IF and Stable Diffusion 3), utilizing diverse text encoders (e.g., from CLIP to the large language models like T5). Project page available at https://t2i-text-loc.github.io/.

Paper Structure

This paper contains 35 sections, 16 figures, 7 tables, 1 algorithm.

Figures (16)

  • Figure 1: Overview of the localization process. Our goal is to edit the image generated from the source prompt $p_S$ using the target prompt $p_T$. To find which cross and joint attention layers should be modified, we pass the target prompt $p_T$ through the DM, caching the keys and values. Then, while generating the image from $p_S$ we substitute the keys and values with the cached ones. We select the layers which yield the highest image and text alignment. (A) Localizing by Patching is applied to SD3, and (B) Localizing by Injection is used for SDXL and DeepFloyd IF.
  • Figure 2: Localized attention layers responsible for the content of the generated text. We selectively patch individual cross and joint attention layers with computations for the target prompt and measure the responses with OCR F1 Score. We identify three layers with the highest responses in SDXL (55, 56, and 57), one layer in DeepFloyd IF (17), and one layer in SD3 (10).
  • Figure 3: The localized layers effectively balance the text alignment with the target prompt $p_T$ and the image alignment with the source prompt $p_S$. For ease of exposition, we measure the text alignment with OCR F1 and the image alignment with SSIM. We observe that injecting the target prompt $p_T$ to too many layers decreases the image alignment and introduces undesirable artifacts, e.g., the Japanese text on the robot's chest in 2nd image from the right and the lack of fish in the 1st image from the right. Conversely, injecting $p_T$ to too few layers does not edit the generated text. We present more details about the experiment in \ref{['app:layers_study']}.
  • Figure 4: Patching preserves visual components from the source prompt, taking only the textual information from the injected target prompt. In all the combinations of templates and texts that we inject to localized layers of diffusion models (with other layers receiving both source template and source text), the final visual components of the image are always closer to the original template, while the textual content is always aligned with the one from an injected prompt. The source prompt is always defined as $p_S$=Template$_{S}$:Text$_{S}$, while we change the target prompts to Template$_{S}$:Text$_{S}$, Template$_{S}$:Text$_{T}$, and Template$_{T}$:Text$_{T}$ (from left to right for the images).
  • Figure 5: Fine-tuning LoRA on localized layers improves text generation quality without compromising overall generation capabilities. We apply LoRA fine-tuning to the SDXL model to enhance its text generation capabilities. (top left) The LoRA fine-tuning on the localized layers converges to a higher quality of the generated text (as measured by OCR F1 and CLIP-T metrics). (bottom left) When fine-tuning LoRA on all cross-attention layers (denoted as C-A), the model quickly collapses, losing its ability to generate examples that match the prompt. The diversity is significantly reduced, as indicated by a recall. In contrast, fine-tuning LoRA only on our localized cross-attention layers prevents model overfitting while improving text generation quality. It preserves diversity while achieving higher fidelity measured by precision. (right) We also present this effect on sample generations. Longer LoRA fine-tuning (measured in epochs) on localized layers improves text quality while preserving visual content, however, applying LoRA to all layers results in significant degradation of the image quality and diversity.
  • ...and 11 more figures