Table of Contents
Fetching ...

Training-Free Multi-Concept Image Editing

Niki Foteinopoulou, Ignas Budvytis, Stephan Liwicki

TL;DR

This work introduces a training-free framework for concept-based image editing, which unifies Optimised DDS with LoRA-driven concept composition, where the training data of the LoRA represent the concept.

Abstract

Editing images with diffusion models without training remains challenging. While recent optimisation-based methods achieve strong zero-shot edits from text, they struggle to preserve identity or capture details that language alone cannot express. Many visual concepts such as facial structure, material texture, or object geometry are impossible to express purely through text prompts alone. To address this gap, we introduce a training-free framework for concept-based image editing, which unifies Optimised DDS with LoRA-driven concept composition, where the training data of the LoRA represent the concept. Our approach enables combining and controlling multiple visual concepts directly within the diffusion process, integrating semantic guidance from text with low-level cues from pretrained concept adapters. We further refine DDS for stability and controllability through ordered timesteps, regularisation, and negative-prompt guidance. Quantitative and qualitative results demonstrate consistent improvements over existing training-free diffusion editing methods on InstructPix2Pix and ComposLoRA benchmarks. Code will be made publicly available.

Training-Free Multi-Concept Image Editing

TL;DR

This work introduces a training-free framework for concept-based image editing, which unifies Optimised DDS with LoRA-driven concept composition, where the training data of the LoRA represent the concept.

Abstract

Editing images with diffusion models without training remains challenging. While recent optimisation-based methods achieve strong zero-shot edits from text, they struggle to preserve identity or capture details that language alone cannot express. Many visual concepts such as facial structure, material texture, or object geometry are impossible to express purely through text prompts alone. To address this gap, we introduce a training-free framework for concept-based image editing, which unifies Optimised DDS with LoRA-driven concept composition, where the training data of the LoRA represent the concept. Our approach enables combining and controlling multiple visual concepts directly within the diffusion process, integrating semantic guidance from text with low-level cues from pretrained concept adapters. We further refine DDS for stability and controllability through ordered timesteps, regularisation, and negative-prompt guidance. Quantitative and qualitative results demonstrate consistent improvements over existing training-free diffusion editing methods on InstructPix2Pix and ComposLoRA benchmarks. Code will be made publicly available.
Paper Structure (20 sections, 13 equations, 13 figures, 8 tables)

This paper contains 20 sections, 13 equations, 13 figures, 8 tables.

Figures (13)

  • Figure 1: Our method enables training-free, concept-based image editing with diffusion models. By unifying an Optimised Delta Denoising Score (DDS) objective with LoRA-driven concept composition, we enable multiple visual concepts, such as facial structure, clothing, or objects, to be combined and controlled directly within the diffusion process that would be impossible to adequately describe in text alone. This approach preserves concept identity and fine-grained details while supporting complex edits, bridging the gap between text-based and visual concept-driven control.
  • Figure 2: Overview of our concept-based image editing framework. We integrate multiple LoRA adapters into the Optimised DDS loop, enabling spatially-aware, concept-preserving edits across multiple concepts. The denoising trajectory follows ordered timesteps, allowing precise control over pose, style, and semantic attributes while maintaining overall composition.
  • Figure 3: Visualisation of timestep-ordered denoising. Comparison between the source image, $\nabla_\theta \mathcal{L}_{\mathrm{OptDDS}}$ at intermediate optimisation steps, and the final target image. Unlike DDS Hertz_2023_ICCV, which samples timesteps uniformly at random, our method enforces a strict descending timestep order, enabling a coarse-to-fine denoising trajectory as is evident from the visualised gradients. Early steps capture high-frequency structural details such as edges, while later steps refine lower-frequency and stylistic components, resulting in a more coherent and stable edit.
  • Figure 4: Comparison of our Optimised DDS vs baselines. Our method produces better results for the target subject with similarly perceived changes in the background.
  • Figure 5: Comparison of our Optimised DDS with concept composition against prefvious LoRA-composition methods, evaluated using GPT-4V. Our method achieves higher perceived image and composition quality, with consistently higher pairwise win rates.
  • ...and 8 more figures