Table of Contents
Fetching ...

Edit Transfer: Learning Image Editing via Vision In-Context Relations

Lan Chen, Qi Mao, Yuchao Gu, Mike Zheng Shou

TL;DR

Edit Transfer introduces a new few-shot editing paradigm that learns a spatial transformation from a single source–target pair and applies it to a new image. By formulating a visual relation in-context learning framework on a DiT-based T2I backbone (FLUX) and using a four-panel composite with lightweight LoRA fine-tuning, it captures complex non-rigid edits beyond text prompts or appearance transfer. The approach demonstrates strong performance on diverse non-rigid and compositional edits using only 42 training images, outperforming state-of-the-art TIE and RIE baselines in qualitative and quantitative evaluations. This work suggests that few-shot visual relation learning can unlock sophisticated geometric edits with minimal data, with potential for broader editing types and cross-species generalization.

Abstract

We introduce a new setting, Edit Transfer, where a model learns a transformation from just a single source-target example and applies it to a new query image. While text-based methods excel at semantic manipulations through textual prompts, they often struggle with precise geometric details (e.g., poses and viewpoint changes). Reference-based editing, on the other hand, typically focuses on style or appearance and fails at non-rigid transformations. By explicitly learning the editing transformation from a source-target pair, Edit Transfer mitigates the limitations of both text-only and appearance-centric references. Drawing inspiration from in-context learning in large language models, we propose a visual relation in-context learning paradigm, building upon a DiT-based text-to-image model. We arrange the edited example and the query image into a unified four-panel composite, then apply lightweight LoRA fine-tuning to capture complex spatial transformations from minimal examples. Despite using only 42 training samples, Edit Transfer substantially outperforms state-of-the-art TIE and RIE methods on diverse non-rigid scenarios, demonstrating the effectiveness of few-shot visual relation learning.

Edit Transfer: Learning Image Editing via Vision In-Context Relations

TL;DR

Edit Transfer introduces a new few-shot editing paradigm that learns a spatial transformation from a single source–target pair and applies it to a new image. By formulating a visual relation in-context learning framework on a DiT-based T2I backbone (FLUX) and using a four-panel composite with lightweight LoRA fine-tuning, it captures complex non-rigid edits beyond text prompts or appearance transfer. The approach demonstrates strong performance on diverse non-rigid and compositional edits using only 42 training images, outperforming state-of-the-art TIE and RIE baselines in qualitative and quantitative evaluations. This work suggests that few-shot visual relation learning can unlock sophisticated geometric edits with minimal data, with potential for broader editing types and cross-species generalization.

Abstract

We introduce a new setting, Edit Transfer, where a model learns a transformation from just a single source-target example and applies it to a new query image. While text-based methods excel at semantic manipulations through textual prompts, they often struggle with precise geometric details (e.g., poses and viewpoint changes). Reference-based editing, on the other hand, typically focuses on style or appearance and fails at non-rigid transformations. By explicitly learning the editing transformation from a source-target pair, Edit Transfer mitigates the limitations of both text-only and appearance-centric references. Drawing inspiration from in-context learning in large language models, we propose a visual relation in-context learning paradigm, building upon a DiT-based text-to-image model. We arrange the edited example and the query image into a unified four-panel composite, then apply lightweight LoRA fine-tuning to capture complex spatial transformations from minimal examples. Despite using only 42 training samples, Edit Transfer substantially outperforms state-of-the-art TIE and RIE methods on diverse non-rigid scenarios, demonstrating the effectiveness of few-shot visual relation learning.

Paper Structure

This paper contains 20 sections, 3 equations, 18 figures, 1 table.

Figures (18)

  • Figure 1: Edit Transfer aims to learn a transformation from a given source–target editing example, and apply the edit to a query image. Our framework can effectively transfer both (b) single and (c) compositional non-rigid edits via proposed visual relation in-context learning.
  • Figure 2: Comparisons with existing editing paradigms. (a) Existing TIE methods hertz2022promptbrooks2023instructpix2pixcao_2023_masactrlwang2024tamingavrahami2024stableflowfeng2024dit4editdiffusiontransformerimage rely solely on text prompts to edit images, making them ineffective for complex non-rigid transformations that are difficult to describe accurately. (b) Existing RIE methods GatysstyleZhucycleganAlaluf2024transferzhou2025attentiondistillationunifiedapproachYangPaintbyExampleChenSpecRefhe2024freeeditchen2024mimicbrushbiswas2025PIXELSChenAnyDoor incorporate visual guidance via a reference image but primarily focus on appearance transfer, failing in non-rigid pose modifications. (c) In contrast, our proposed Edit Transfer learns and applies the transformation observed in editing examples to a query image, effectively handling intricate non-rigid edits.
  • Figure 3: Visual relation in-context learning for Edit Transfer. (a) We arrange in-context examples in a four-panel layout: the top row (an editing pair $(\mathcal{I}_s, \mathcal{I}_t)$) and the bottom row (the query pair $(\mathcal{\hat{I}}_s, \mathcal{\hat{I}}_t)$). Our goal is to to learn the transformation from $\mathcal{I}_s \to \mathcal{I}_t$, and apply it to the bottom-left image $\hat{\mathcal{I}}_s$, producing the target $\hat{\mathcal{I}}_t$, in the bottom-right. (b) We fine-tune a lightweight LoRA in the MMA to better capture visual relations. Noise addition and removal are applied only to $z_t$, while the conditional tokens $c_T$ ( derived from $(\mathcal{I}_s, \mathcal{I}_t, \hat{\mathcal{I}}_s)$) remain noise-free. (c) Finally, we cast Edit Transfer as an image generation task by initializing the bottom-right latent token $z_T$ with random noise and concatenating it with the clean tokens $c_I$. Leveraging the enhanced in-context capability of the fine-tuned DiT blocks, the model generates $\mathcal{I}_t$, effectively transferring the same edits from the top row to the bottom-left image.
  • Figure 4: Edit Transfer exhibits impressive versatility to transfer visual exemplar pairs'edit into the requested source image, delivering high-quality (a) single-edit transformations as well as (b) effective compositional edits that seamlessly combine multiple modifications.
  • Figure 5: Qualitative comparisons. Compared with TIE and RIE methods, our method consistently outperforms in various non-rigid editing tasks. We provide the detailed text prompt of TIE methods in Section \ref{['subsec:IB']}.
  • ...and 13 more figures