InstructBrush: Learning Attention-based Instruction Optimization for Image Editing
Ruoyu Zhao, Qingnan Fan, Fei Kou, Shuai Qin, Hong Gu, Wei Wu, Pengcheng Xu, Mingrui Zhu, Nannan Wang, Xinbo Gao
TL;DR
InstructBrush addresses the gap where language alone cannot express certain image edits by learning editing instructions from exemplar image pairs and applying them to new images. It introduces Attention-based Instruction Optimization to directly optimize cross-attention features and Transformation-oriented Instruction Initialization to encode editing priors via unique transformation phrases, supported by the TOP-Bench benchmark for open-scenario evaluation. Empirical results show consistent improvements in editing quality and semantic alignment over prior methods, with notable gains in local editing tasks due to reduced content leakage. The work provides a practical framework and a benchmarking suite to advance instruction inversion for robust, exemplar-guided image editing.
Abstract
In recent years, instruction-based image editing methods have garnered significant attention in image editing. However, despite encompassing a wide range of editing priors, these methods are helpless when handling editing tasks that are challenging to accurately describe through language. We propose InstructBrush, an inversion method for instruction-based image editing methods to bridge this gap. It extracts editing effects from exemplar image pairs as editing instructions, which are further applied for image editing. Two key techniques are introduced into InstructBrush, Attention-based Instruction Optimization and Transformation-oriented Instruction Initialization, to address the limitations of the previous method in terms of inversion effects and instruction generalization. To explore the ability of instruction inversion methods to guide image editing in open scenarios, we establish a TransformationOriented Paired Benchmark (TOP-Bench), which contains a rich set of scenes and editing types. The creation of this benchmark paves the way for further exploration of instruction inversion. Quantitatively and qualitatively, our approach achieves superior performance in editing and is more semantically consistent with the target editing effects.
