Steering Away from Memorization: Reachability-Constrained Reinforcement Learning for Text-to-Image Diffusion

Sathwik Karnik; Juyeop Kim; Sanmi Koyejo; Jong-Seok Lee; Somil Bansal

Steering Away from Memorization: Reachability-Constrained Reinforcement Learning for Text-to-Image Diffusion

Sathwik Karnik, Juyeop Kim, Sanmi Koyejo, Jong-Seok Lee, Somil Bansal

TL;DR

Reachability-Aware Diffusion Steering (RADS), an inference-time framework that prevents memorization while preserving generation fidelity and achieves a superior Pareto frontier between generation diversity, quality, and alignment compared to state-of-the-art baselines.

Abstract

Text-to-image diffusion models often memorize training data, revealing a fundamental failure to generalize beyond the training set. Current mitigation strategies typically sacrifice image quality or prompt alignment to reduce memorization. To address this, we propose Reachability-Aware Diffusion Steering (RADS), an inference-time framework that prevents memorization while preserving generation fidelity. RADS models the diffusion denoising process as a dynamical system and applies concepts from reachability analysis to approximate the "backward reachable tube"--the set of intermediate states that inevitably evolve into memorized samples. We then formulate mitigation as a constrained reinforcement learning (RL) problem, where a policy learns to steer the trajectory away from memorization via minimal perturbations in the caption embedding space. Empirical evaluations show that RADS achieves a superior Pareto frontier between generation diversity (SSCD), quality (FID), and alignment (CLIP) compared to state-of-the-art baselines. Crucially, RADS provides robust mitigation without modifying the diffusion backbone, offering a plug-and-play solution for safe generation. Our website is available at: https://s-karnik.github.io/rads-memorization-project-page/.

Steering Away from Memorization: Reachability-Constrained Reinforcement Learning for Text-to-Image Diffusion

TL;DR

Abstract

Paper Structure (43 sections, 9 equations, 12 figures, 2 tables)

This paper contains 43 sections, 9 equations, 12 figures, 2 tables.

Introduction
Related Work
Preliminaries: Denoising Diffusion
Reachability-Aware Diffusion Steering
Denoising as a Controlled Dynamical System
Memorization as a Reachability Problem
Failure Set and Backward Reachable Tube.
Target Function and Worst-Case Evolution.
Safety Target Function.
Constrained Markov Decision Process
Constraint.
Reward Function.
Constrained RL Solution (Soft Actor-Critic)
Architecture.
Experiments
...and 28 more sections

Figures (12)

Figure 1: $\textsc{rads}$ outperforms prior work. Generated images for the prompt Design Art Beautiful View of Paris Paris Eiffel Towerunder Red Sky Ultra Glossy Cityscape Circle Wall Art. (a) Memorized target image from the training set. (b) Mitigated results produced by prior methods; 1– 4 correspond to wen2024detectingren2024crossattnhintersdorf2024nemojain2025attraction, respectively. (c) Mitigated result produced by $\textsc{rads}$ (ours).
Figure 2: Overview of $\textsc{rads}$. This diagram illustrates a real example of how $\textsc{rads}$ prevents memorization in the Stable Diffusion v1.4 model, while the baseline (no mitigation) generates an image that closely resembles the training sample. $\textsc{rads}$ does so by modeling the "attraction basin" of memorization as a backward reachable tube (BRT) and learning a policy $\pi_{\phi}$ for steering the caption embedding inputs to the diffusion model using reachability-constrained reinforcement learning.
Figure 3: Memorization happens early. Images generated with text guidance enabled for only the first $k$ steps ($k \in \{0,1,2\}$). Just 2 steps of guidance (c) are sufficient to reproduce the memorized image (d). Caption: The No Limits Business Woman Podcast.
Figure 4: Pareto Frontier Analysis: Quality and Alignment vs. Replication. We compare $\textsc{rads}$ (Ours) against various state-of-the-art mitigation methods on the webster2023extraction dataset. The top row shows Quality ($-\log_{10}(\mathrm{FID}) \uparrow$), while the bottom row displays Alignment (CLIP Score $\uparrow$). The x-axis shows the $(1-\textbf{SSCD}_{\textbf{target}}) \uparrow$ scores. $\textsc{rads}$ consistently occupies the upper-right region of the frontier, maintaining high utility and semantic alignment while significantly reducing memorization.
Figure 5: jain2025attraction produces mitigated samples that closely resemble one another. Generated images using the same initial latent $\mathbf{x}_T$. (a) Generated image without text guidance ($g = 0$). (b–d) Generated images produced using different prompts.
...and 7 more figures

Steering Away from Memorization: Reachability-Constrained Reinforcement Learning for Text-to-Image Diffusion

TL;DR

Abstract

Steering Away from Memorization: Reachability-Constrained Reinforcement Learning for Text-to-Image Diffusion

Authors

TL;DR

Abstract

Table of Contents

Figures (12)