Table of Contents
Fetching ...

RelationAdapter: Learning and Transferring Visual Relation with Diffusion Transformers

Yan Gong, Yiren Song, Yicheng Li, Chenglin Li, Yin Zhang

TL;DR

RelationAdapter introduces a lightweight visual-prompt editing framework for Diffusion Transformers that decouples edit-intent extraction from image generation. By employing a dual-branch RelationAdapter and an In-Context Editor with position encoding cloning and LoRA fine-tuning, the method achieves strong generalization across 218 editing tasks using minimal training samples. A large-scale Relation252K dataset enables robust evaluation of transfer and adaptation to unseen edits. Experimental results show consistent gains in pixel fidelity, semantic similarity, and editing consistency over state-of-the-art baselines, with efficient parameter usage and scalable training.

Abstract

Inspired by the in-context learning mechanism of large language models (LLMs), a new paradigm of generalizable visual prompt-based image editing is emerging. Existing single-reference methods typically focus on style or appearance adjustments and struggle with non-rigid transformations. To address these limitations, we propose leveraging source-target image pairs to extract and transfer content-aware editing intent to novel query images. To this end, we introduce RelationAdapter, a lightweight module that enables Diffusion Transformer (DiT) based models to effectively capture and apply visual transformations from minimal examples. We also introduce Relation252K, a comprehensive dataset comprising 218 diverse editing tasks, to evaluate model generalization and adaptability in visual prompt-driven scenarios. Experiments on Relation252K show that RelationAdapter significantly improves the model's ability to understand and transfer editing intent, leading to notable gains in generation quality and overall editing performance.

RelationAdapter: Learning and Transferring Visual Relation with Diffusion Transformers

TL;DR

RelationAdapter introduces a lightweight visual-prompt editing framework for Diffusion Transformers that decouples edit-intent extraction from image generation. By employing a dual-branch RelationAdapter and an In-Context Editor with position encoding cloning and LoRA fine-tuning, the method achieves strong generalization across 218 editing tasks using minimal training samples. A large-scale Relation252K dataset enables robust evaluation of transfer and adaptation to unseen edits. Experimental results show consistent gains in pixel fidelity, semantic similarity, and editing consistency over state-of-the-art baselines, with efficient parameter usage and scalable training.

Abstract

Inspired by the in-context learning mechanism of large language models (LLMs), a new paradigm of generalizable visual prompt-based image editing is emerging. Existing single-reference methods typically focus on style or appearance adjustments and struggle with non-rigid transformations. To address these limitations, we propose leveraging source-target image pairs to extract and transfer content-aware editing intent to novel query images. To this end, we introduce RelationAdapter, a lightweight module that enables Diffusion Transformer (DiT) based models to effectively capture and apply visual transformations from minimal examples. We also introduce Relation252K, a comprehensive dataset comprising 218 diverse editing tasks, to evaluate model generalization and adaptability in visual prompt-driven scenarios. Experiments on Relation252K show that RelationAdapter significantly improves the model's ability to understand and transfer editing intent, leading to notable gains in generation quality and overall editing performance.

Paper Structure

This paper contains 41 sections, 7 equations, 14 figures, 7 tables.

Figures (14)

  • Figure 1: Our framework, RelationAdapter, can effectively perform a variety of image editing tasks by relying on exemplar image pairs and the original image. These tasks include (a) low-level editing, (b) style transfer, (c) image editing, and (d) customized generation.
  • Figure 2: The overall architecture and training paradigm of RelationAdapter. We employ the RelationAdapter to decouple inputs by injecting visual prompt features into the MMAttention module to control the generation process. Meanwhile, a high-rank LoRA is used to train the In-Context Editor on a large-scale dataset. During inference, the In-Context Editor encodes the source image into conditional tokens, concatenates them with noise-added latent tokens, and directs the generation via the MMAttention module.
  • Figure 3: Overview of the four main task categories in our dataset. Each block lists representative sub-tasks (with ellipses indicating more), along with image-pair examples.
  • Figure 4: Overview of the annotation pipeline using GPT-4o. GPT-4o generates a set of source caption, target caption, and edit instruction describing the transformation from $I_{\text{src}}$ to $I_{\text{tar}}$.
  • Figure 5: Compared to baselines, RelationAdapter demonstrates outstanding instruction-following ability, image consistency, and editing effectiveness on both seen and unseen tasks.
  • ...and 9 more figures