Table of Contents
Fetching ...

X2Edit: Revisiting Arbitrary-Instruction Image Editing through Self-Constructed Data and Task-Aware Representation Learning

Jian Ma, Xujie Zhu, Zihao Pan, Qirong Peng, Xu Guo, Chen Chen, Haonan Lu

TL;DR

X2Edit introduces a large-scale, 3.7M-image dataset across 14 arbitrary-instruction editing tasks and a lightweight, plug-and-play editing model built on Task-aware MoE-LoRA. The approach couples a diffusion-based editing backbone with a task-embedding MoE and a task-aware contrastive loss to structure the hidden space, achieving competitive results on multiple benchmarks and enabling seamless Flux.1 integration. This work substantially advances open-source data quality and model efficiency for flexible image editing, with practical impact on community editing workflows and cross-lingual capabilities.

Abstract

Existing open-source datasets for arbitrary-instruction image editing remain suboptimal, while a plug-and-play editing module compatible with community-prevalent generative models is notably absent. In this paper, we first introduce the X2Edit Dataset, a comprehensive dataset covering 14 diverse editing tasks, including subject-driven generation. We utilize the industry-leading unified image generation models and expert models to construct the data. Meanwhile, we design reasonable editing instructions with the VLM and implement various scoring mechanisms to filter the data. As a result, we construct 3.7 million high-quality data with balanced categories. Second, to better integrate seamlessly with community image generation models, we design task-aware MoE-LoRA training based on FLUX.1, with only 8\% of the parameters of the full model. To further improve the final performance, we utilize the internal representations of the diffusion model and define positive/negative samples based on image editing types to introduce contrastive learning. Extensive experiments demonstrate that the model's editing performance is competitive among many excellent models. Additionally, the constructed dataset exhibits substantial advantages over existing open-source datasets. The open-source code, checkpoints, and datasets for X2Edit can be found at the following link: https://github.com/OPPO-Mente-Lab/X2Edit.

X2Edit: Revisiting Arbitrary-Instruction Image Editing through Self-Constructed Data and Task-Aware Representation Learning

TL;DR

X2Edit introduces a large-scale, 3.7M-image dataset across 14 arbitrary-instruction editing tasks and a lightweight, plug-and-play editing model built on Task-aware MoE-LoRA. The approach couples a diffusion-based editing backbone with a task-embedding MoE and a task-aware contrastive loss to structure the hidden space, achieving competitive results on multiple benchmarks and enabling seamless Flux.1 integration. This work substantially advances open-source data quality and model efficiency for flexible image editing, with practical impact on community editing workflows and cross-lingual capabilities.

Abstract

Existing open-source datasets for arbitrary-instruction image editing remain suboptimal, while a plug-and-play editing module compatible with community-prevalent generative models is notably absent. In this paper, we first introduce the X2Edit Dataset, a comprehensive dataset covering 14 diverse editing tasks, including subject-driven generation. We utilize the industry-leading unified image generation models and expert models to construct the data. Meanwhile, we design reasonable editing instructions with the VLM and implement various scoring mechanisms to filter the data. As a result, we construct 3.7 million high-quality data with balanced categories. Second, to better integrate seamlessly with community image generation models, we design task-aware MoE-LoRA training based on FLUX.1, with only 8\% of the parameters of the full model. To further improve the final performance, we utilize the internal representations of the diffusion model and define positive/negative samples based on image editing types to introduce contrastive learning. Extensive experiments demonstrate that the model's editing performance is competitive among many excellent models. Additionally, the constructed dataset exhibits substantial advantages over existing open-source datasets. The open-source code, checkpoints, and datasets for X2Edit can be found at the following link: https://github.com/OPPO-Mente-Lab/X2Edit.

Paper Structure

This paper contains 35 sections, 7 equations, 20 figures, 10 tables.

Figures (20)

  • Figure 1: The X2Edit image generation results span 14 diverse editing types. In each unit, the left image serves as the reference. The central modality in the top-right unit is the input to X2I and can be leveraged by other modalities to assist in image editing.
  • Figure 2: The comprehensive construction pipeline of X2Edit Dataset. We divide the pipeline into four stages: (1) Sampling from real-world datasets and synthesizing source images using our internal query dataset; (2) Generating diverse editing instructions using a VLM based on the source images; (3) Generating edited images using task-specific workflows according to the editing instructions; (4) Conducting comprehensive evaluation and filtering of all generated data to ensure quality.
  • Figure 3: X2Edit Dataset Collection Distribution.
  • Figure 4: X2Edit consists of an MLLM for editing instruction understanding, a DiT fine-tuned based on FLUX.1, an optional intent perception model, and task embeddings. We introduce a task-aware MoE-LoRA structure and task-aware contrastive learning into the DiT to enhance the unified editing model's ability to perceive different editing tasks.
  • Figure 5: PQ score, SC score and overall VIEScore evaluated by GPT-4o on GEdit-Bench++.
  • ...and 15 more figures