Table of Contents
Fetching ...

SAEdit: Token-level control for continuous image editing via Sparse AutoEncoder

Ronen Kamenetsky, Sara Dorfman, Daniel Garibi, Roni Paiss, Or Patashnik, Daniel Cohen-Or

TL;DR

SAEdit addresses the challenge of disentangled and continuous image editing by performing token-level manipulation of text embeddings through a Sparse Autoencoder (SAE). It derives sparse, attribute-specific edit directions by comparing the sparse representations of source and target prompts, then applies these directions to individual tokens with a controllable scale factor $\omega$ while keeping the diffusion renderer unchanged. The method is model-agnostic, enabling application across backbones like Flux and Stable Diffusion, and introduces an exponential injection schedule $\omega_t = \min\left(e^{t \cdot \omega} - 1, \tau\right)$ to preserve global structure during editing. Extensive experiments, including quantitative benchmarks and real-image editing via inversion, demonstrate strong identity preservation, high prompt fidelity, and robust, continuous control across diverse attributes and domains.

Abstract

Large-scale text-to-image diffusion models have become the backbone of modern image editing, yet text prompts alone do not offer adequate control over the editing process. Two properties are especially desirable: disentanglement, where changing one attribute does not unintentionally alter others, and continuous control, where the strength of an edit can be smoothly adjusted. We introduce a method for disentangled and continuous editing through token-level manipulation of text embeddings. The edits are applied by manipulating the embeddings along carefully chosen directions, which control the strength of the target attribute. To identify such directions, we employ a Sparse Autoencoder (SAE), whose sparse latent space exposes semantically isolated dimensions. Our method operates directly on text embeddings without modifying the diffusion process, making it model agnostic and broadly applicable to various image synthesis backbones. Experiments show that it enables intuitive and efficient manipulations with continuous control across diverse attributes and domains.

SAEdit: Token-level control for continuous image editing via Sparse AutoEncoder

TL;DR

SAEdit addresses the challenge of disentangled and continuous image editing by performing token-level manipulation of text embeddings through a Sparse Autoencoder (SAE). It derives sparse, attribute-specific edit directions by comparing the sparse representations of source and target prompts, then applies these directions to individual tokens with a controllable scale factor while keeping the diffusion renderer unchanged. The method is model-agnostic, enabling application across backbones like Flux and Stable Diffusion, and introduces an exponential injection schedule to preserve global structure during editing. Extensive experiments, including quantitative benchmarks and real-image editing via inversion, demonstrate strong identity preservation, high prompt fidelity, and robust, continuous control across diverse attributes and domains.

Abstract

Large-scale text-to-image diffusion models have become the backbone of modern image editing, yet text prompts alone do not offer adequate control over the editing process. Two properties are especially desirable: disentanglement, where changing one attribute does not unintentionally alter others, and continuous control, where the strength of an edit can be smoothly adjusted. We introduce a method for disentangled and continuous editing through token-level manipulation of text embeddings. The edits are applied by manipulating the embeddings along carefully chosen directions, which control the strength of the target attribute. To identify such directions, we employ a Sparse Autoencoder (SAE), whose sparse latent space exposes semantically isolated dimensions. Our method operates directly on text embeddings without modifying the diffusion process, making it model agnostic and broadly applicable to various image synthesis backbones. Experiments show that it enables intuitive and efficient manipulations with continuous control across diverse attributes and domains.

Paper Structure

This paper contains 33 sections, 7 equations, 22 figures, 3 tables.

Figures (22)

  • Figure 1: We train a Sparse AutoEncoder (SAE) to lift the text embeddings into a higher-dimensional space, where we identify disentangled semantic directions (e.g. for laughing). These directions can then be applied to specific tokens within the input of a text-to-image model to facilitate continuous image editing. As shown on the right, our token-level editing steers the model to incorporate the relevant attribute (laughing) into the subject in the image that corresponds to the chosen token (e.g., “woman” or “kid”), while allowing the attribute’s intensity to be continuously adjusted through a scale factor, $\omega_t$.
  • Figure 2: Naïvely applying T5 edit direction (top) by interpolating T5 embedding of target edit, introduces entangled changes that may distort the scene. This can appear as an insufficient edit (left example) or as the modification of unwanted elements (right example). In contrast, edit directions found by the SAE (bottom) yield disentangled edits that preserve identity and achieve the intended modification.
  • Figure 3: We train the Sparse Autoencoder on token embeddings obtained from a frozen T5 encoder, using reconstruction and sparsity losses.
  • Figure 4: Extracting Edit Directions. We derive an edit direction from a prompt pair that isolates a single attribute. Both prompts are encoded with the SAE, and their token representations are aggregated via max-pooling. By comparing the two resulting sparse vectors, we identify the key features corresponding to the desired change. The edit direction is a sparse vector composed of only these key features, taken from the target prompt's representation.
  • Figure 5: Applying the edit direction. An aggregated edit direction is scaled to adjust edit magnitude and applied to the sparse representation of the relevant source token (e.g., man). The result is then decoded back into the T5 embedding space, and used to condition the text-to-image model.
  • ...and 17 more figures