Table of Contents
Fetching ...

A$^2$-Edit: Precise Reference-Guided Image Editing of Arbitrary Objects and Ambiguous Masks

Huayu Zheng, Guangzhao Li, Baixuan Zhao, Siqi Luo, Hantao Jiang, Guangtao Zhai, Xiaohong Liu

TL;DR

A unified inpainting framework for arbitrary object categories, which allows users to replace any target region with a reference object using only a coarse mask, and a MATS that progressively relaxes mask precision during training is proposed, reducing the model's reliance on accurate masks and improving robustness across diverse editing tasks.

Abstract

We propose \textbf{A$^2$-Edit}, a unified inpainting framework for arbitrary object categories, which allows users to replace any target region with a reference object using only a coarse mask. To address the issues of severe homogenization and limited category coverage in existing datasets, we construct a large-scale, multi-category dataset \textbf{UniEdit-500K}, which includes 8 major categories, 209 fine-grained subcategories, and a total of 500,104 image pairs. Such rich category diversity poses new challenges for the model, requiring it to automatically learn semantic relationships and distinctions across categories. To this end, we introduce the \textbf{Mixture of Transformer} module, which performs differentiated modeling of various object categories through dynamic expert selection, and further enhances cross-category semantic transfer and generalization through collaboration among experts. In addition, we propose a \textbf{Mask Annealing Training Strategy} (MATS) that progressively relaxes mask precision during training, reducing the model's reliance on accurate masks and improving robustness across diverse editing tasks. Extensive experiments on benchmarks such as VITON-HD and AnyInsertion demonstrate that A$^2$-Edit consistently outperforms existing approaches across all metrics, providing a new and efficient solution for arbitrary object editing.

A$^2$-Edit: Precise Reference-Guided Image Editing of Arbitrary Objects and Ambiguous Masks

TL;DR

A unified inpainting framework for arbitrary object categories, which allows users to replace any target region with a reference object using only a coarse mask, and a MATS that progressively relaxes mask precision during training is proposed, reducing the model's reliance on accurate masks and improving robustness across diverse editing tasks.

Abstract

We propose \textbf{A-Edit}, a unified inpainting framework for arbitrary object categories, which allows users to replace any target region with a reference object using only a coarse mask. To address the issues of severe homogenization and limited category coverage in existing datasets, we construct a large-scale, multi-category dataset \textbf{UniEdit-500K}, which includes 8 major categories, 209 fine-grained subcategories, and a total of 500,104 image pairs. Such rich category diversity poses new challenges for the model, requiring it to automatically learn semantic relationships and distinctions across categories. To this end, we introduce the \textbf{Mixture of Transformer} module, which performs differentiated modeling of various object categories through dynamic expert selection, and further enhances cross-category semantic transfer and generalization through collaboration among experts. In addition, we propose a \textbf{Mask Annealing Training Strategy} (MATS) that progressively relaxes mask precision during training, reducing the model's reliance on accurate masks and improving robustness across diverse editing tasks. Extensive experiments on benchmarks such as VITON-HD and AnyInsertion demonstrate that A-Edit consistently outperforms existing approaches across all metrics, providing a new and efficient solution for arbitrary object editing.
Paper Structure (39 sections, 5 equations, 15 figures, 3 tables)

This paper contains 39 sections, 5 equations, 15 figures, 3 tables.

Figures (15)

  • Figure 1: Visual Results of A2-Edit: A Unified Framework for Arbitrary-Object Inpainting. Our method supports a wide spectrum of real-world scenarios, delivering high-quality results across diverse object categories. The examples illustrate the generality and robustness of our method in handling various image editing tasks. More results are provided in the supplementary material.
  • Figure 2: Overview of the A$^2$-Edit framework. Our architecture takes a reference image, target image, and user mask as core inputs, encodes them into a unified feature space, and feeds the fused features into Mixture-of-Transformers (MoT) module. The MoT Block (MoTB) denotes an expert-specific Transformer block embedded within the attention and feed-forward layers. The model then dynamically routes features to specialized experts for differentiated modeling. Model is trained via Mask Annealing Training Strategy (MATS), a three-stage training process. Final outputs are decoded by a VAE decoder for high-fidelity results.
  • Figure 3: An overview of our UniEdit-500K dataset. The central pie chart illustrates the proportional distribution across the eight major categories. The word cloud below visualizes the diversity of object classes within the dataset. Surrounding the center are representative examples from each category (Portraits, Accessories, Plants, Vehicles, Animals, Garments, Furniture, and Architecture), showcasing the data format which includes a reference image, a source image, and their corresponding segmentation masks.
  • Figure 4: Qualitative comparison with existing mask-guided image editing methods. Our method consistently produces more coherent structures, sharper details, and better semantic alignment with the target edit compared to existing methods( MimicBrush chen2024zero, FLUX.1-Kontext flux-kontext, ACE++ mao2025ace++ and InsertAnything song2025insert).
  • Figure 5: Ablation Study. Removing the framework, training data, or MATS reduces generation quality and cross-category generalization, while progressively adding each component significantly improves detail fidelity and the model’s ability to handle rough masks.
  • ...and 10 more figures