Table of Contents
Fetching ...

InstructBrush: Learning Attention-based Instruction Optimization for Image Editing

Ruoyu Zhao, Qingnan Fan, Fei Kou, Shuai Qin, Hong Gu, Wei Wu, Pengcheng Xu, Mingrui Zhu, Nannan Wang, Xinbo Gao

TL;DR

InstructBrush addresses the gap where language alone cannot express certain image edits by learning editing instructions from exemplar image pairs and applying them to new images. It introduces Attention-based Instruction Optimization to directly optimize cross-attention features and Transformation-oriented Instruction Initialization to encode editing priors via unique transformation phrases, supported by the TOP-Bench benchmark for open-scenario evaluation. Empirical results show consistent improvements in editing quality and semantic alignment over prior methods, with notable gains in local editing tasks due to reduced content leakage. The work provides a practical framework and a benchmarking suite to advance instruction inversion for robust, exemplar-guided image editing.

Abstract

In recent years, instruction-based image editing methods have garnered significant attention in image editing. However, despite encompassing a wide range of editing priors, these methods are helpless when handling editing tasks that are challenging to accurately describe through language. We propose InstructBrush, an inversion method for instruction-based image editing methods to bridge this gap. It extracts editing effects from exemplar image pairs as editing instructions, which are further applied for image editing. Two key techniques are introduced into InstructBrush, Attention-based Instruction Optimization and Transformation-oriented Instruction Initialization, to address the limitations of the previous method in terms of inversion effects and instruction generalization. To explore the ability of instruction inversion methods to guide image editing in open scenarios, we establish a TransformationOriented Paired Benchmark (TOP-Bench), which contains a rich set of scenes and editing types. The creation of this benchmark paves the way for further exploration of instruction inversion. Quantitatively and qualitatively, our approach achieves superior performance in editing and is more semantically consistent with the target editing effects.

InstructBrush: Learning Attention-based Instruction Optimization for Image Editing

TL;DR

InstructBrush addresses the gap where language alone cannot express certain image edits by learning editing instructions from exemplar image pairs and applying them to new images. It introduces Attention-based Instruction Optimization to directly optimize cross-attention features and Transformation-oriented Instruction Initialization to encode editing priors via unique transformation phrases, supported by the TOP-Bench benchmark for open-scenario evaluation. Empirical results show consistent improvements in editing quality and semantic alignment over prior methods, with notable gains in local editing tasks due to reduced content leakage. The work provides a practical framework and a benchmarking suite to advance instruction inversion for robust, exemplar-guided image editing.

Abstract

In recent years, instruction-based image editing methods have garnered significant attention in image editing. However, despite encompassing a wide range of editing priors, these methods are helpless when handling editing tasks that are challenging to accurately describe through language. We propose InstructBrush, an inversion method for instruction-based image editing methods to bridge this gap. It extracts editing effects from exemplar image pairs as editing instructions, which are further applied for image editing. Two key techniques are introduced into InstructBrush, Attention-based Instruction Optimization and Transformation-oriented Instruction Initialization, to address the limitations of the previous method in terms of inversion effects and instruction generalization. To explore the ability of instruction inversion methods to guide image editing in open scenarios, we establish a TransformationOriented Paired Benchmark (TOP-Bench), which contains a rich set of scenes and editing types. The creation of this benchmark paves the way for further exploration of instruction inversion. Quantitatively and qualitatively, our approach achieves superior performance in editing and is more semantically consistent with the target editing effects.
Paper Structure (18 sections, 8 equations, 13 figures, 4 tables)

This paper contains 18 sections, 8 equations, 13 figures, 4 tables.

Figures (13)

  • Figure 1: InstructBrush: an inversion method for instruction-based image editing methods. It extracts editing effects from few reference image pairs as editing instructions, which are further applied for image editing. our InstructBrush achieves superior performance in editing and is more semantically consistent with the target editing effects.
  • Figure 2: The Framework of InstructBrush. InstructBrush inverts instructions from exemplar image pairs by proposing novel Transformation-oriented Instruction Initialization (a) and Attention-based Instruction Optimization (b) modules. The former is proposed to initialize the instruction, which effectively introduces the editing-related prior to facilitate semantic alignment of the instruction with the exemplar image pairs. The latter introduces the editing instruction into the cross-attention layers of the instruction-based image editing model and directly optimizes the Keys and Values corresponding to the instruction within these layers. After optimization, the learned instructions are used to guide the editing of new images (c).
  • Figure 3: Visualization of Applying Time-aware Instructions to Various Denoising Steps. Example: $T = 800$ represents the application of our time-aware instruction before the denoising time step of 800 (steps 1000 to 800), while the None instruction is applied to the denoising process after 800 steps (steps 800 to 0). Therefore, $T = 1000$ indicates the input image, and $T = 0$ indicates our full implementation. The visualization results show that in the early denoising stages, the editing focuses on coarse information such as colors (rows 2 and 3); in the later stages, the editing focuses on detailed information such as textures and facial expressions (rows 1 and 3).
  • Figure 4: Qualitative Comparisons with Existing Methods. Our method achieves superior performance in both local and global image editing. It effectively avoids introducing editing-irrelevant information from the training images, showing better instruction generalization.
  • Figure 5: More Visualization Results of Our Method. Our method demonstrates robust performance on both local and global editing. And it does not introduce scene information of the training image when editing new images, which reflects the instruction generalization of our method.
  • ...and 8 more figures