Precise Parameter Localization for Textual Generation in Diffusion Models
Łukasz Staniszewski, Bartosz Cywiński, Franziska Boenisch, Kamil Deja, Adam Dziedzic
TL;DR
This work tackles the challenge of understanding and controlling textual content in diffusion-model generated images by localizing a tiny subset of cross and joint attention parameters. It introduces an activation-patching method that identifies a small fraction of parameters—$0.61\%$ for SDXL, $0.21\%$ for DeepFloyd IF, and $0.23\%$ for SD3—responsible for text generation across diverse architectures and encoders. The authors demonstrate three practical benefits: targeted LoRA fine-tuning on localized layers improves text quality without sacrificing visual fidelity, patching enables precise text editing within images, and on-the-fly text substitution can prevent toxic text while preserving other visuals. The approach is architecture-agnostic and compatible with multiple text encoders (e.g., CLIP, T5), offering a scalable path to safer, more efficient text-to-image generation. These findings advance both the interpretability and practical control of diffusion-based text rendering in images, with immediate implications for editing, safety, and customization.
Abstract
Novel diffusion models can synthesize photo-realistic images with integrated high-quality text. Surprisingly, we demonstrate through attention activation patching that only less than 1% of diffusion models' parameters, all contained in attention layers, influence the generation of textual content within the images. Building on this observation, we improve textual generation efficiency and performance by targeting cross and joint attention layers of diffusion models. We introduce several applications that benefit from localizing the layers responsible for textual content generation. We first show that a LoRA-based fine-tuning solely of the localized layers enhances, even more, the general text-generation capabilities of large diffusion models while preserving the quality and diversity of the diffusion models' generations. Then, we demonstrate how we can use the localized layers to edit textual content in generated images. Finally, we extend this idea to the practical use case of preventing the generation of toxic text in a cost-free manner. In contrast to prior work, our localization approach is broadly applicable across various diffusion model architectures, including U-Net (e.g., LDM and SDXL) and transformer-based (e.g., DeepFloyd IF and Stable Diffusion 3), utilizing diverse text encoders (e.g., from CLIP to the large language models like T5). Project page available at https://t2i-text-loc.github.io/.
