Table of Contents
Fetching ...

SliderEdit: Continuous Image Editing with Fine-Grained Instruction Control

Arman Zarei, Samyadeep Basu, Mobina Pournemat, Sayan Nag, Ryan Rossi, Soheil Feizi

TL;DR

SliderEdit tackles the discreteness of instruction-based image editing by enabling continuous, per-instruction control through lightweight adapters trained with a Partial Prompt Suppression loss. By embedding and selectively modulating target instruction tokens within the MM-DiT framework, SliderEdit delivers smooth, disentangled edit trajectories via STLoRA and GSTLoRA, allowing a single framework to handle multi-instruction prompts and zero-shot personalization without per-instruction retraining. The approach integrates with state-of-the-art editors like FLUX-Kontext and Qwen-Image-Edit, achieving improved continuity, semantic consistency, and user steerability across local and global edits. Overall, SliderEdit paves the way for interactive, instruction-driven image manipulation with continuous, compositional control, enabling nuanced editing workflows in real images.

Abstract

Instruction-based image editing models have recently achieved impressive performance, enabling complex edits to an input image from a multi-instruction prompt. However, these models apply each instruction in the prompt with a fixed strength, limiting the user's ability to precisely and continuously control the intensity of individual edits. We introduce SliderEdit, a framework for continuous image editing with fine-grained, interpretable instruction control. Given a multi-part edit instruction, SliderEdit disentangles the individual instructions and exposes each as a globally trained slider, allowing smooth adjustment of its strength. Unlike prior works that introduced slider-based attribute controls in text-to-image generation, typically requiring separate training or fine-tuning for each attribute or concept, our method learns a single set of low-rank adaptation matrices that generalize across diverse edits, attributes, and compositional instructions. This enables continuous interpolation along individual edit dimensions while preserving both spatial locality and global semantic consistency. We apply SliderEdit to state-of-the-art image editing models, including FLUX-Kontext and Qwen-Image-Edit, and observe substantial improvements in edit controllability, visual consistency, and user steerability. To the best of our knowledge, we are the first to explore and propose a framework for continuous, fine-grained instruction control in instruction-based image editing models. Our results pave the way for interactive, instruction-driven image manipulation with continuous and compositional control.

SliderEdit: Continuous Image Editing with Fine-Grained Instruction Control

TL;DR

SliderEdit tackles the discreteness of instruction-based image editing by enabling continuous, per-instruction control through lightweight adapters trained with a Partial Prompt Suppression loss. By embedding and selectively modulating target instruction tokens within the MM-DiT framework, SliderEdit delivers smooth, disentangled edit trajectories via STLoRA and GSTLoRA, allowing a single framework to handle multi-instruction prompts and zero-shot personalization without per-instruction retraining. The approach integrates with state-of-the-art editors like FLUX-Kontext and Qwen-Image-Edit, achieving improved continuity, semantic consistency, and user steerability across local and global edits. Overall, SliderEdit paves the way for interactive, instruction-driven image manipulation with continuous, compositional control, enabling nuanced editing workflows in real images.

Abstract

Instruction-based image editing models have recently achieved impressive performance, enabling complex edits to an input image from a multi-instruction prompt. However, these models apply each instruction in the prompt with a fixed strength, limiting the user's ability to precisely and continuously control the intensity of individual edits. We introduce SliderEdit, a framework for continuous image editing with fine-grained, interpretable instruction control. Given a multi-part edit instruction, SliderEdit disentangles the individual instructions and exposes each as a globally trained slider, allowing smooth adjustment of its strength. Unlike prior works that introduced slider-based attribute controls in text-to-image generation, typically requiring separate training or fine-tuning for each attribute or concept, our method learns a single set of low-rank adaptation matrices that generalize across diverse edits, attributes, and compositional instructions. This enables continuous interpolation along individual edit dimensions while preserving both spatial locality and global semantic consistency. We apply SliderEdit to state-of-the-art image editing models, including FLUX-Kontext and Qwen-Image-Edit, and observe substantial improvements in edit controllability, visual consistency, and user steerability. To the best of our knowledge, we are the first to explore and propose a framework for continuous, fine-grained instruction control in instruction-based image editing models. Our results pave the way for interactive, instruction-driven image manipulation with continuous and compositional control.

Paper Structure

This paper contains 28 sections, 11 equations, 17 figures, 2 tables, 2 algorithms.

Figures (17)

  • Figure 1: SliderEdit produces continuous edit trajectories in state-of-the-art instruction-based image editing models. Our method provides fine-grained and disentangled control over the intensity of edit attributes described in an instruction, allowing continuous transitions between editing strengths. Despite its effectiveness, SliderEdit is extremely lightweight and can be trained efficiently to transform a state-of-the-art instruction-based image editing model into a continuously controllable editing framework.
  • Figure 2: Instruction-token embedding interpolation for strength control. Interpolating between instruction and null-token embeddings produces intermediate edit strengths, demonstrating the potential for achieving fine-grained control through direct manipulation of intermediate instruction embeddings.
  • Figure 3: Overview of the SliderEdit training pipeline. Learnable low-rank matrices are applied to the intermediate token embeddings corresponding to the target edit instruction. These adapters are trained using the Partial Prompt Suppression (PPS) loss, which encourages the model to suppress or neutralize the visual effect of the selected instruction tokens.
  • Figure 4: Qualitative Samples of GSTLoRA. Demonstrates smooth, continuous control over the strength of both local and global edits.
  • Figure 5: Controllable zero-shot multi-subject personalization with STLoRA. STLoRA enables smooth adjustment of each instruction’s strength to generate coherent, evolving image sequences, supporting story-like visual editing. (Best viewed from top-left to top-right, then bottom-right to bottom-left)
  • ...and 12 more figures