Table of Contents
Fetching ...

LoRA of Change: Learning to Generate LoRA for the Editing Instruction from A Single Before-After Image Pair

Xue Song, Jiequan Cui, Hanwang Zhang, Jiaxin Shi, Jingjing Chen, Chi Zhang, Yu-Gang Jiang

TL;DR

LoC tackles the problem of ambiguous text prompts in image editing by using before-after visual instructions to capture user intent. It introduces a dynamic LoRA generation mechanism (LoC) via a hypernetwork that encodes the change between a before and after image and injects it into a frozen editing model, along with LoRA Reverse to regularize learning from paired data. The method demonstrates broad support for editing types and yields high-quality results on SEED-Data-Edit and MagicBrush with real-time inference. This work offers interpretable, reusable instruction-specific LoRAs for real-world visual editing while acknowledging potential misuse and the need for safeguards.

Abstract

In this paper, we propose the LoRA of Change (LoC) framework for image editing with visual instructions, i.e., before-after image pairs. Compared to the ambiguities, insufficient specificity, and diverse interpretations of natural language, visual instructions can accurately reflect users' intent. Building on the success of LoRA in text-based image editing and generation, we dynamically learn an instruction-specific LoRA to encode the "change" in a before-after image pair, enhancing the interpretability and reusability of our model. Furthermore, generalizable models for image editing with visual instructions typically require quad data, i.e., a before-after image pair, along with query and target images. Due to the scarcity of such quad data, existing models are limited to a narrow range of visual instructions. To overcome this limitation, we introduce the LoRA Reverse optimization technique, enabling large-scale training with paired data alone. Extensive qualitative and quantitative experiments demonstrate that our model produces high-quality images that align with user intent and support a broad spectrum of real-world visual instructions.

LoRA of Change: Learning to Generate LoRA for the Editing Instruction from A Single Before-After Image Pair

TL;DR

LoC tackles the problem of ambiguous text prompts in image editing by using before-after visual instructions to capture user intent. It introduces a dynamic LoRA generation mechanism (LoC) via a hypernetwork that encodes the change between a before and after image and injects it into a frozen editing model, along with LoRA Reverse to regularize learning from paired data. The method demonstrates broad support for editing types and yields high-quality results on SEED-Data-Edit and MagicBrush with real-time inference. This work offers interpretable, reusable instruction-specific LoRAs for real-world visual editing while acknowledging potential misuse and the need for safeguards.

Abstract

In this paper, we propose the LoRA of Change (LoC) framework for image editing with visual instructions, i.e., before-after image pairs. Compared to the ambiguities, insufficient specificity, and diverse interpretations of natural language, visual instructions can accurately reflect users' intent. Building on the success of LoRA in text-based image editing and generation, we dynamically learn an instruction-specific LoRA to encode the "change" in a before-after image pair, enhancing the interpretability and reusability of our model. Furthermore, generalizable models for image editing with visual instructions typically require quad data, i.e., a before-after image pair, along with query and target images. Due to the scarcity of such quad data, existing models are limited to a narrow range of visual instructions. To overcome this limitation, we introduce the LoRA Reverse optimization technique, enabling large-scale training with paired data alone. Extensive qualitative and quantitative experiments demonstrate that our model produces high-quality images that align with user intent and support a broad spectrum of real-world visual instructions.

Paper Structure

This paper contains 15 sections, 9 equations, 12 figures, 2 tables.

Figures (12)

  • Figure 1: Image editing with before-after image pair instructions.
  • Figure 2: Information leakage without LoRA Reverse training.
  • Figure 3: Comparison of qualitative examples across the 6 editing types between our LoC and two SOTAs.
  • Figure 4: Overview of LoRA of Change (LoC) framework. (a) shows that the hypernetwork $\mathcal{H}$ generates the instruction-specific LoRA with the before-after image pair $<A, A^{'}>$ as inputs. (b) presents the LoRA reverse training. With the generated LoRA $\Delta$, the red arrows $\to$ indicate that the model is trained to reconstruct $B^{'}$ taking $B$ as spatial condition while the blue arrows $\to$ indicate that the model is trained to reconstruct $B$ taking $B^{'}$ as spatial condition. (c) is the inference for image editing. $\mathcal{L}$ is the image reconstruction loss.
  • Figure 5: Hypernetwork $\mathcal{H}$ for LoRA generation. The transformer decoder $\mathcal{D}$ consists of $M=6$ blocks.
  • ...and 7 more figures