Table of Contents
Fetching ...

DreamReader: An Interpretability Toolkit for Text-to-Image Models

Nirmalendu Prakash, Narmeen Oozeer, Michael Lan, Luka Samkharadze, Phillip Howard, Roy Ka-Wei Lee, Dhruv Nathawani, Shivam Raval, Amirali Abdullah

Abstract

Despite the rapid adoption of text-to-image (T2I) diffusion models, causal and representation-level analysis remains fragmented and largely limited to isolated probing techniques. To address this gap, we introduce DreamReader: a unified framework that formalizes diffusion interpretability as composable representation operators spanning activation extraction, causal patching, structured ablations, and activation steering across modules and timesteps. DreamReader provides a model-agnostic abstraction layer enabling systematic analysis and intervention across diffusion architectures. Beyond consolidating existing methods, DreamReader introduces three novel intervention primitives for diffusion models: (1) representation fine-tuning (LoReFT) for subspace-constrained internal adaptation; (2) classifier-guided gradient steering using MLP probes trained on activations; and (3) component-level cross-model mapping for systematic study of transferability of representations across modalities. These mechanisms allows us to do lightweight white-box interventions on T2I models by drawing inspiration from interpretability techniques on LLMs. We demonstrate DreamReader through controlled experiments that (i) perform activation stitching between two models, and (ii) apply LoReFT to steer multiple activation units, reliably injecting a target concept into the generated images. Experiments are specified declaratively and executed in controlled batched pipelines to enable reproducible large-scale analysis. Across multiple case studies, we show that techniques adapted from language model interpretability yield promising and controllable interventions in diffusion models. DreamReader is released as an open source toolkit for advancing research on T2I interpretability.

DreamReader: An Interpretability Toolkit for Text-to-Image Models

Abstract

Despite the rapid adoption of text-to-image (T2I) diffusion models, causal and representation-level analysis remains fragmented and largely limited to isolated probing techniques. To address this gap, we introduce DreamReader: a unified framework that formalizes diffusion interpretability as composable representation operators spanning activation extraction, causal patching, structured ablations, and activation steering across modules and timesteps. DreamReader provides a model-agnostic abstraction layer enabling systematic analysis and intervention across diffusion architectures. Beyond consolidating existing methods, DreamReader introduces three novel intervention primitives for diffusion models: (1) representation fine-tuning (LoReFT) for subspace-constrained internal adaptation; (2) classifier-guided gradient steering using MLP probes trained on activations; and (3) component-level cross-model mapping for systematic study of transferability of representations across modalities. These mechanisms allows us to do lightweight white-box interventions on T2I models by drawing inspiration from interpretability techniques on LLMs. We demonstrate DreamReader through controlled experiments that (i) perform activation stitching between two models, and (ii) apply LoReFT to steer multiple activation units, reliably injecting a target concept into the generated images. Experiments are specified declaratively and executed in controlled batched pipelines to enable reproducible large-scale analysis. Across multiple case studies, we show that techniques adapted from language model interpretability yield promising and controllable interventions in diffusion models. DreamReader is released as an open source toolkit for advancing research on T2I interpretability.
Paper Structure (18 sections, 4 figures, 4 tables)

This paper contains 18 sections, 4 figures, 4 tables.

Figures (4)

  • Figure 1: DreamReader provides a unified abstraction of the diffusion stack, exposing four core interpretability operators—Localization, Steering, Stitching, and Sparse Autoencoders (SAEs)—that can be composed within a shared interface. Shown above are the 4 steps in the analysis (shown here is an example for 'Steering').
  • Figure 2: LoREFT steering results on sample prompts for SDXL-Turbo. We steer the model to add spectacles by training a LoREFT module on cross-attention activations from different U-Net regions (down, mid, and up blocks). Steering effectiveness varies by prompt: for Simba, down-block steering succeeds, whereas for Jack Sparrow, up-block steering performs best. CLIP score measures prompt--image alignment; FID and LPIPS quantify the deviation from the baseline (unsteered) output. Results are logged and visualized with W&B.
  • Figure 3: Steering. We use CAA to extract a steering direction from a fine-tuned SD 1.5 model using a minimal contrastive setup with two prompts (e.g., "a photo of a Black man" vs. "a photo of a man"). We map this direction to the base model with our learned mapper and apply it during generation. The figure compares baseline samples (left) with the corresponding steered samples (right) for two example prompts.
  • Figure 4: Stitching workflow (code snippet). Example usage of the DreamReaderStitcher to train an MLPMapper that maps source-model activations to a target activation space. The snippet instantiates the workflow and mapper (1), specifies optimization, precision, and logging via TrainingSpec (2), and executes run_trainer() to obtain the trained mapper from output.preds (3).