Table of Contents
Fetching ...

FireRed-Image-Edit-1.0 Techinical Report

Super Intelligence Team, Changhao Qiao, Chao Hui, Chen Li, Cunzheng Wang, Dejia Song, Jiale Zhang, Jing Li, Qiang Xiang, Runqi Wang, Shuang Sun, Wei Zhu, Xu Tang, Yao Hu, Yibo Chen, Yuhao Huang, Yuxuan Duan, Zhiyi Chen, Ziyuan Guo

TL;DR

FireRed-Image-Edit is presented, a diffusion transformer for instruction-based image editing that achieves state-of-the-art performance through systematic optimization of data curation, training methodology, and evaluation design and is established as a comprehensive benchmark spanning 15 editing categories, including newly introduced beautification and low-level enhancement tasks.

Abstract

We present FireRed-Image-Edit, a diffusion transformer for instruction-based image editing that achieves state-of-the-art performance through systematic optimization of data curation, training methodology, and evaluation design. We construct a 1.6B-sample training corpus, comprising 900M text-to-image and 700M image editing pairs from diverse sources. After rigorous cleaning, stratification, auto-labeling, and two-stage filtering, we retain over 100M high-quality samples balanced between generation and editing, ensuring strong semantic coverage and instruction alignment. Our multi-stage training pipeline progressively builds editing capability via pre-training, supervised fine-tuning, and reinforcement learning. To improve data efficiency, we introduce a Multi-Condition Aware Bucket Sampler for variable-resolution batching and Stochastic Instruction Alignment with dynamic prompt re-indexing. To stabilize optimization and enhance controllability, we propose Asymmetric Gradient Optimization for DPO, DiffusionNFT with layout-aware OCR rewards for text editing, and a differentiable Consistency Loss for identity preservation. We further establish REDEdit-Bench, a comprehensive benchmark spanning 15 editing categories, including newly introduced beautification and low-level enhancement tasks. Extensive experiments on REDEdit-Bench and public benchmarks (ImgEdit and GEdit) demonstrate competitive or superior performance against both open-source and proprietary systems. We release code, models, and the benchmark suite to support future research.

FireRed-Image-Edit-1.0 Techinical Report

TL;DR

FireRed-Image-Edit is presented, a diffusion transformer for instruction-based image editing that achieves state-of-the-art performance through systematic optimization of data curation, training methodology, and evaluation design and is established as a comprehensive benchmark spanning 15 editing categories, including newly introduced beautification and low-level enhancement tasks.

Abstract

We present FireRed-Image-Edit, a diffusion transformer for instruction-based image editing that achieves state-of-the-art performance through systematic optimization of data curation, training methodology, and evaluation design. We construct a 1.6B-sample training corpus, comprising 900M text-to-image and 700M image editing pairs from diverse sources. After rigorous cleaning, stratification, auto-labeling, and two-stage filtering, we retain over 100M high-quality samples balanced between generation and editing, ensuring strong semantic coverage and instruction alignment. Our multi-stage training pipeline progressively builds editing capability via pre-training, supervised fine-tuning, and reinforcement learning. To improve data efficiency, we introduce a Multi-Condition Aware Bucket Sampler for variable-resolution batching and Stochastic Instruction Alignment with dynamic prompt re-indexing. To stabilize optimization and enhance controllability, we propose Asymmetric Gradient Optimization for DPO, DiffusionNFT with layout-aware OCR rewards for text editing, and a differentiable Consistency Loss for identity preservation. We further establish REDEdit-Bench, a comprehensive benchmark spanning 15 editing categories, including newly introduced beautification and low-level enhancement tasks. Extensive experiments on REDEdit-Bench and public benchmarks (ImgEdit and GEdit) demonstrate competitive or superior performance against both open-source and proprietary systems. We release code, models, and the benchmark suite to support future research.
Paper Structure (65 sections, 8 equations, 17 figures, 7 tables)

This paper contains 65 sections, 8 equations, 17 figures, 7 tables.

Figures (17)

  • Figure 1: This figure benchmarks generative image models across four human evaluation dimensions (alignment, consistency, realism, aesthetics) and three editing tasks (Imgedit, Gedit, RedEdit).
  • Figure 2: Showcase of FireRed-Image-Edit in general image editing.
  • Figure 3: Overview of Data Distrubution. Our training dataset achieves an approximate 1:1 ratio between Text-to-Image (T2I) and Image-to-Image (I2I) tasks after data cleaning. The T2I component is divided into three main categories: Nature (general-purpose generation), People (human-centric generation), and Design (artistic styles, text rendering, and complex layouts). The I2I component is also divided into three categories: Semantic Editing (content-based modifications), Stylistic Editing (aesthetic adjustments), and Structural Editing (spatial arrangement and composition). Our data collection strategy ensures a balance of diversity and quality throughout the training process, providing comprehensive coverage and precise annotations to foster robust model training.
  • Figure 4: Overview of Data Filtering. Our multi-stage data filtering pipeline includes deduplication, photometric and statistical filtering, artifact removal, perceptual quality assessment, and AIGC detection. These steps ensure the dataset is free from redundancy, visual noise, artifacts, and synthetic content, maintaining high-quality samples for image editing model training.
  • Figure 5: Overview of Data Engine. The data production engine generates paired image editing samples through three forward construction strategies. (1) Instructional Control synthesizes expert models using instruction templates and edit-target lexicons grounded by VLM discovery and auxiliary metadata. (2) Structured Control leverages structural priors such as masks and pose keypoints extracted from perception modules to guide expert models with precise control signals. (3) Model-free Template-based Synthesis includes approaches such as predefined 3D templates, layout templates, and algorithmic filters to enable controllable and deterministic generation. The pipeline is designed to be iterable, supporting complex multi-step edits.
  • ...and 12 more figures