Table of Contents
Fetching ...

The Unreasonable Effectiveness of Text Embedding Interpolation for Continuous Image Steering

Yigit Ekin, Yossi Gandelsman

Abstract

We present a training-free framework for continuous and controllable image editing at test time for text-conditioned generative models. In contrast to prior approaches that rely on additional training or manual user intervention, we find that a simple steering in the text-embedding space is sufficient to produce smooth edit control. Given a target concept (e.g., enhancing photorealism or changing facial expression), we use a large language model to automatically construct a small set of debiased contrastive prompt pairs, from which we compute a steering vector in the generator's text-encoder space. We then add this vector directly to the input prompt representation to control generation along the desired semantic axis. To obtain a continuous control, we propose an elastic range search procedure that automatically identifies an effective interval of steering magnitudes, avoiding both under-steering (no-edit) and over-steering (changing other attributes). Adding the scaled versions of the same vector within this interval yields smooth and continuous edits. Since our method modifies only textual representations, it naturally generalizes across text-conditioned modalities, including image and video generation. To quantify the steering continuity, we introduce a new evaluation metric that measures the uniformity of semantic change across edit strengths. We compare the continuous editing behavior across methods and find that, despite its simplicity and lightweight design, our approach is comparable to training-based alternatives, outperforming other training-free methods.

The Unreasonable Effectiveness of Text Embedding Interpolation for Continuous Image Steering

Abstract

We present a training-free framework for continuous and controllable image editing at test time for text-conditioned generative models. In contrast to prior approaches that rely on additional training or manual user intervention, we find that a simple steering in the text-embedding space is sufficient to produce smooth edit control. Given a target concept (e.g., enhancing photorealism or changing facial expression), we use a large language model to automatically construct a small set of debiased contrastive prompt pairs, from which we compute a steering vector in the generator's text-encoder space. We then add this vector directly to the input prompt representation to control generation along the desired semantic axis. To obtain a continuous control, we propose an elastic range search procedure that automatically identifies an effective interval of steering magnitudes, avoiding both under-steering (no-edit) and over-steering (changing other attributes). Adding the scaled versions of the same vector within this interval yields smooth and continuous edits. Since our method modifies only textual representations, it naturally generalizes across text-conditioned modalities, including image and video generation. To quantify the steering continuity, we introduce a new evaluation metric that measures the uniformity of semantic change across edit strengths. We compare the continuous editing behavior across methods and find that, despite its simplicity and lightweight design, our approach is comparable to training-based alternatives, outperforming other training-free methods.
Paper Structure (41 sections, 8 equations, 7 figures, 5 tables)

This paper contains 41 sections, 8 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Our framework. Given a user text prompt, our method enables controllable editing in text-to-image generation without retraining. (a) In the default setting, the prompt is encoded by the text encoder and used by the generative pipeline to produce an image. (b) To introduce edit control, we derive a steering direction $\mathbf{d}_s$ by computing a difference-of-means vector from contrastive text pairs (e.g., “smiling person” vs. “neutral face person”). (c) We automatically determine the optimal steering strength range via an elastic range search, preventing understeering or oversteering artifacts. (d) Such continuous steering along $\mathbf{d}_s$ allows smooth manipulation of the desired attribute (e.g., increasing smile intensity).
  • Figure 2: Illustration of bias inheritance in steering. When the age direction is computed from a biased dataset (e.g., predominantly old men), the resulting steering vector entangles gender with age. Consequently, age manipulations not only modify apparent age but also introduce unintended gender-related changes, revealing the dataset’s underlying bias.
  • Figure 3: Effect of steering magnitude on edit strength. Weak steering produces minimal visual change, while large magnitudes lead to semantic drift. Our elastic range search algorithm automatically identifies the optimal steering range that achieves perceptually consistent edits. In this case, a balanced cartoon stylization.
  • Figure 4: Trade-off between edit strength and fidelity.$\Delta$VQA (edit success) vs. DreamSim (distance to the original). Curves correspond to different methods; numeric labels denote increasing steering strengths. Methods closer to the upper--left corner achieve stronger edits at lower distortion. Training-based methods are given in dashed lines while training-free methods are given in straight lines.
  • Figure 5: Qualitative Results. Our method enables a diverse range of continuous and disentangled semantic edits across various image styles. Leftmost images with red border are initial generations, and from left to right the edit strength is increased. We demonstrate the ability to add style (photorealism), global scene change (crowdness) and local texture change (age). Methods with Blue text are training based methods
  • ...and 2 more figures