Table of Contents
Fetching ...

SpecRef: A Fast Training-free Baseline of Specific Reference-Condition Real Image Editing

Songyan Chen, Jiancheng Huang

TL;DR

This work defines Specific Reference Condition Real Image Editing, enabling edits guided not only by a source image and prompts but also by a reference image. It proposes SpecRef, a fast, training-free baseline with a two-stage pipeline (inversion and editing) that extracts reference features via DDIM inversion and incorporates them through a Specific Reference Attention Layer with masking to control where reference content is injected. The key contributions are the task formulation, reference feature extraction from $I_2$, the SR-attn mechanism that blends reference and editing regions using masks, and extensive ablations showing improved control over content replacement with a reference while preserving non-edited areas. The approach offers a practical, controllable baseline for real-image editing with potential impact on editing workflows, though it has limitations in spatially mismatched cases and motivates future robust, learned methods.

Abstract

Text-conditional image editing based on large diffusion generative model has attracted the attention of both the industry and the research community. Most existing methods are non-reference editing, with the user only able to provide a source image and text prompt. However, it restricts user's control over the characteristics of editing outcome. To increase user freedom, we propose a new task called Specific Reference Condition Real Image Editing, which allows user to provide a reference image to further control the outcome, such as replacing an object with a particular one. To accomplish this, we propose a fast baseline method named SpecRef. Specifically, we design a Specific Reference Attention Controller to incorporate features from the reference image, and adopt a mask mechanism to prevent interference between editing and non-editing regions. We evaluate SpecRef on typical editing tasks and show that it can achieve satisfactory performance. The source code is available on https://github.com/jingjiqinggong/specp2p.

SpecRef: A Fast Training-free Baseline of Specific Reference-Condition Real Image Editing

TL;DR

This work defines Specific Reference Condition Real Image Editing, enabling edits guided not only by a source image and prompts but also by a reference image. It proposes SpecRef, a fast, training-free baseline with a two-stage pipeline (inversion and editing) that extracts reference features via DDIM inversion and incorporates them through a Specific Reference Attention Layer with masking to control where reference content is injected. The key contributions are the task formulation, reference feature extraction from , the SR-attn mechanism that blends reference and editing regions using masks, and extensive ablations showing improved control over content replacement with a reference while preserving non-edited areas. The approach offers a practical, controllable baseline for real-image editing with potential impact on editing workflows, though it has limitations in spatially mismatched cases and motivates future robust, learned methods.

Abstract

Text-conditional image editing based on large diffusion generative model has attracted the attention of both the industry and the research community. Most existing methods are non-reference editing, with the user only able to provide a source image and text prompt. However, it restricts user's control over the characteristics of editing outcome. To increase user freedom, we propose a new task called Specific Reference Condition Real Image Editing, which allows user to provide a reference image to further control the outcome, such as replacing an object with a particular one. To accomplish this, we propose a fast baseline method named SpecRef. Specifically, we design a Specific Reference Attention Controller to incorporate features from the reference image, and adopt a mask mechanism to prevent interference between editing and non-editing regions. We evaluate SpecRef on typical editing tasks and show that it can achieve satisfactory performance. The source code is available on https://github.com/jingjiqinggong/specp2p.
Paper Structure (13 sections, 5 equations, 6 figures, 1 algorithm)

This paper contains 13 sections, 5 equations, 6 figures, 1 algorithm.

Figures (6)

  • Figure 1: A demonstration of the existing non-reference editing task and our new task. The top row is the existing task and the bottom row is our specific reference condition image editing task.
  • Figure 2: The pipeline of SpecRef, consisting of two stages, inversion stage and editing stage. During inversion stage, we perform inversion on both source image $I_1$ and reference image $I_2$ to obtain the noisy latents and reference features. Then in the editing stage, there are two paths, reconstruction path for reconstruct $I_1$ and editing path for generating the edited result.
  • Figure 3: The proposed Specific Reference Attention Layer (SR-attn).
  • Figure 4: The experimental results. Our SpecRef can solve the problem whereby non-reference editing (p2p) fails to generate content for certain words, and replace the object by specific reference.
  • Figure 5: The experimental results. Our SpecRef can solve the problem whereby non-reference editing (p2p) fails to generate content for certain words, and replace the object by specific reference.
  • ...and 1 more figures