Table of Contents
Fetching ...

NoHumansRequired: Autonomous High-Quality Image Editing Triplet Mining

Maksim Kuprashevich, Grigorii Alekseenko, Irina Tolstykh, Georgii Fedorov, Bulat Suleimanov, Vladimir Dokholyan, Aleksandr Gordeev

TL;DR

The paper tackles the data bottleneck in training instruction-guided image editors by introducing an automated NoHumansRequired pipeline that mines high-quality triplets without human input. It builds a modular system combining prompt engineering, a T2I generator, an instruction-guided editor, and a two-stage validation stack, augmented with semantic inversion and bootstrap composition to expand data. A task-specific Gemini validator is fine-tuned to reliably judge instruction adherence and aesthetics, enabling scalable, high-fidelity data generation and evaluation. The authors release NHR-Edit (720k triplets) and Bagel-NHR-Edit (LoRA-tuned Bagel) and demonstrate state-of-the-art performance on public benchmarks, illustrating substantial potential for self-improving vision-language editing systems.

Abstract

Recent advances in generative modeling enable image editing assistants that follow natural language instructions without additional user input. Their supervised training requires millions of triplets (original image, instruction, edited image), yet mining pixel-accurate examples is hard. Each edit must affect only prompt-specified regions, preserve stylistic coherence, respect physical plausibility, and retain visual appeal. The lack of robust automated edit-quality metrics hinders reliable automation at scale. We present an automated, modular pipeline that mines high-fidelity triplets across domains, resolutions, instruction complexities, and styles. Built on public generative models and running without human intervention, our system uses a task-tuned Gemini validator to score instruction adherence and aesthetics directly, removing any need for segmentation or grounding models. Inversion and compositional bootstrapping enlarge the mined set by approx. 2.6x, enabling large-scale high-fidelity training data. By automating the most repetitive annotation steps, the approach allows a new scale of training without human labeling effort. To democratize research in this resource-intensive area, we release NHR-Edit, an open dataset of 720k high-quality triplets, curated at industrial scale via millions of guided generations and validator passes, and we analyze the pipeline's stage-wise survival rates, providing a framework for estimating computational effort across different model stacks. In the largest cross-dataset evaluation, it surpasses all public alternatives. We also release Bagel-NHR-Edit, a fine-tuned Bagel model with state-of-the-art metrics.

NoHumansRequired: Autonomous High-Quality Image Editing Triplet Mining

TL;DR

The paper tackles the data bottleneck in training instruction-guided image editors by introducing an automated NoHumansRequired pipeline that mines high-quality triplets without human input. It builds a modular system combining prompt engineering, a T2I generator, an instruction-guided editor, and a two-stage validation stack, augmented with semantic inversion and bootstrap composition to expand data. A task-specific Gemini validator is fine-tuned to reliably judge instruction adherence and aesthetics, enabling scalable, high-fidelity data generation and evaluation. The authors release NHR-Edit (720k triplets) and Bagel-NHR-Edit (LoRA-tuned Bagel) and demonstrate state-of-the-art performance on public benchmarks, illustrating substantial potential for self-improving vision-language editing systems.

Abstract

Recent advances in generative modeling enable image editing assistants that follow natural language instructions without additional user input. Their supervised training requires millions of triplets (original image, instruction, edited image), yet mining pixel-accurate examples is hard. Each edit must affect only prompt-specified regions, preserve stylistic coherence, respect physical plausibility, and retain visual appeal. The lack of robust automated edit-quality metrics hinders reliable automation at scale. We present an automated, modular pipeline that mines high-fidelity triplets across domains, resolutions, instruction complexities, and styles. Built on public generative models and running without human intervention, our system uses a task-tuned Gemini validator to score instruction adherence and aesthetics directly, removing any need for segmentation or grounding models. Inversion and compositional bootstrapping enlarge the mined set by approx. 2.6x, enabling large-scale high-fidelity training data. By automating the most repetitive annotation steps, the approach allows a new scale of training without human labeling effort. To democratize research in this resource-intensive area, we release NHR-Edit, an open dataset of 720k high-quality triplets, curated at industrial scale via millions of guided generations and validator passes, and we analyze the pipeline's stage-wise survival rates, providing a framework for estimating computational effort across different model stacks. In the largest cross-dataset evaluation, it surpasses all public alternatives. We also release Bagel-NHR-Edit, a fine-tuned Bagel model with state-of-the-art metrics.

Paper Structure

This paper contains 34 sections, 2 equations, 27 figures, 17 tables.

Figures (27)

  • Figure 1: High-quality samples from our NHR-Edit dataset.
  • Figure 2: Example of a generated T2I prompt and its corresponding edit instructions.
  • Figure 3: Solid arrows represent forward instructions, and dashed arrows represent their semantic inversions. Instructions for compositional triplets are aggregated from both forward instructions and inversions.
  • Figure B.1: Score distributions for the training and validation splits of the assessor fine-tuning dataset.
  • Figure B.2: Composition of the Gemini Assessor Fine-Tuning Corpus by Source Model. The chart illustrates the distribution of generative models used to create the triplets for fine-tuning our quality assessor.
  • ...and 22 more figures