Table of Contents
Fetching ...

IE-Critic-R1: Advancing the Explanatory Measurement of Text-Driven Image Editing for Human Perception Alignment

Bowen Qu, Shangkun Sun, Xiaoyu Liang, Wei Gao

TL;DR

IE-Critic-R1 addresses the challenge of evaluating text-driven image editing by introducing IE-Bench, a large, human-annotated benchmark across four quality dimensions, and a two-stage evaluation framework (IE-Critic-CoT and IE-Critic-R1) that combines chain-of-thought reasoning with reinforcement learning from verifiable rewards. By leveraging multi-dimensional human scores and GPT-4o-based CoT data, the approach achieves state-of-the-art alignment with human perception on IE-Bench and AGIQA-3k, illustrating the value of reasoning-then-scoring and carefully designed rewards. The work demonstrates that the R1 Moment—longer, more detailed reasoning trajectories—emerges through the RLVR process, enabling more faithful, interpretable assessments of edited images. Data and code are publicly available to accelerate research and practical adoption in text-driven image editing evaluation.

Abstract

Recent advances in text-driven image editing have been significant, yet the task of accurately evaluating these edited images continues to pose a considerable challenge. Different from the assessment of text-driven image generation, text-driven image editing is characterized by simultaneously conditioning on both text and a source image. The edited images often retain an intrinsic connection to the original image, which dynamically change with the semantics of the text. However, previous methods tend to solely focus on text-image alignment or have not well aligned with human perception. In this work, we introduce the Text-driven Image Editing Benchmark suite (IE-Bench) to enhance the assessment of text-driven edited images. IE-Bench includes a database contains diverse source images, various editing prompts and the corresponding edited results from different editing methods, and nearly 4,000 samples with corresponding Mean Opinion Scores (MOS) provided by 15 human subjects. Furthermore, we introduce IE-Critic-R1, which, benefiting from Reinforcement Learning from Verifiable Rewards (RLVR), provides more comprehensive and explainable quality assessment for text-driven image editing that aligns with human perception. Extensive experiments demonstrate IE-Critic-R1's superior subjective-alignments on the text-driven image editing task compared with previous metrics. Related data and codes are available to the public.

IE-Critic-R1: Advancing the Explanatory Measurement of Text-Driven Image Editing for Human Perception Alignment

TL;DR

IE-Critic-R1 addresses the challenge of evaluating text-driven image editing by introducing IE-Bench, a large, human-annotated benchmark across four quality dimensions, and a two-stage evaluation framework (IE-Critic-CoT and IE-Critic-R1) that combines chain-of-thought reasoning with reinforcement learning from verifiable rewards. By leveraging multi-dimensional human scores and GPT-4o-based CoT data, the approach achieves state-of-the-art alignment with human perception on IE-Bench and AGIQA-3k, illustrating the value of reasoning-then-scoring and carefully designed rewards. The work demonstrates that the R1 Moment—longer, more detailed reasoning trajectories—emerges through the RLVR process, enabling more faithful, interpretable assessments of edited images. Data and code are publicly available to accelerate research and practical adoption in text-driven image editing evaluation.

Abstract

Recent advances in text-driven image editing have been significant, yet the task of accurately evaluating these edited images continues to pose a considerable challenge. Different from the assessment of text-driven image generation, text-driven image editing is characterized by simultaneously conditioning on both text and a source image. The edited images often retain an intrinsic connection to the original image, which dynamically change with the semantics of the text. However, previous methods tend to solely focus on text-image alignment or have not well aligned with human perception. In this work, we introduce the Text-driven Image Editing Benchmark suite (IE-Bench) to enhance the assessment of text-driven edited images. IE-Bench includes a database contains diverse source images, various editing prompts and the corresponding edited results from different editing methods, and nearly 4,000 samples with corresponding Mean Opinion Scores (MOS) provided by 15 human subjects. Furthermore, we introduce IE-Critic-R1, which, benefiting from Reinforcement Learning from Verifiable Rewards (RLVR), provides more comprehensive and explainable quality assessment for text-driven image editing that aligns with human perception. Extensive experiments demonstrate IE-Critic-R1's superior subjective-alignments on the text-driven image editing task compared with previous metrics. Related data and codes are available to the public.

Paper Structure

This paper contains 33 sections, 4 equations, 10 figures, 8 tables.

Figures (10)

  • Figure 1: Overview of IE-Bench and IE-Critic-R1. Compared to the Qwen Baseline, the IE-Critic-R1 trained through RLVR design has stronger thinking ability.
  • Figure 2: Score distributions of various editing methods. The central line represents the median, and the whiskers extend to the minimum and maximum values.
  • Figure 3: Statistics of IE-Bench prompts. (a) Word cloud of IE-Bench DB prompts. (b) Proportion of different types
  • Figure 4: Overlay Comparison of Reward Functions.
  • Figure 5: Response length curves during RL training. (a) CoT + Direct SFT maintains longer and more stable responses compared to CoT-only SFT. (b) The $\ell_1$ reward function achieves the best performance in encouraging detailed reasoning, while Laplacian, $\ell_2$, and Gaussian rewards show declining trends.
  • ...and 5 more figures