Table of Contents
Fetching ...

ImageDoctor: Diagnosing Text-to-Image Generation via Grounded Image Reasoning

Yuxiang Guo, Jiang Liu, Ze Wang, Hao Chen, Ximeng Sun, Yang Zhao, Jialian Wu, Xiaodong Yu, Zicheng Liu, Emad Barsoum

TL;DR

This work tackles the challenge of evaluating text-to-image generation with interpretable, multi-dimensional feedback. It introduces ImageDoctor, a unified multimodal evaluator that outputs four quality scores and pixel-level heatmaps, using a look-think-predict reasoning paradigm trained via cold-start supervision and reinforcement finetuning. It further extends RLHF with DenseFlow-GRPO, incorporating dense, pixel-level rewards to guide region-aware improvements in T2I generation. Across RichHF-18K and cross-domain benchmarks, ImageDoctor achieves strong human-alignment as a metric, verifier, and reward function, and DenseFlow-GRPO yields the most robust gains in local detail fidelity and alignment with human preferences.

Abstract

The rapid advancement of text-to-image (T2I) models has increased the need for reliable human preference modeling, a demand further amplified by recent progress in reinforcement learning for preference alignment. However, existing approaches typically quantify the quality of a generated image using a single scalar, limiting their ability to provide comprehensive and interpretable feedback on image quality. To address this, we introduce ImageDoctor, a unified multi-aspect T2I model evaluation framework that assesses image quality across four complementary dimensions: plausibility, semantic alignment, aesthetics, and overall quality. ImageDoctor also provides pixel-level flaw indicators in the form of heatmaps, which highlight misaligned or implausible regions, and can be used as a dense reward for T2I model preference alignment. Inspired by the diagnostic process, we improve the detail sensitivity and reasoning capability of ImageDoctor by introducing a "look-think-predict" paradigm, where the model first localizes potential flaws, then generates reasoning, and finally concludes the evaluation with quantitative scores. Built on top of a vision-language model and trained through a combination of supervised fine-tuning and reinforcement learning, ImageDoctor demonstrates strong alignment with human preference across multiple datasets, establishing its effectiveness as an evaluation metric. Furthermore, when used as a reward model for preference tuning, ImageDoctor significantly improves generation quality -- achieving an improvement of 10% over scalar-based reward models.

ImageDoctor: Diagnosing Text-to-Image Generation via Grounded Image Reasoning

TL;DR

This work tackles the challenge of evaluating text-to-image generation with interpretable, multi-dimensional feedback. It introduces ImageDoctor, a unified multimodal evaluator that outputs four quality scores and pixel-level heatmaps, using a look-think-predict reasoning paradigm trained via cold-start supervision and reinforcement finetuning. It further extends RLHF with DenseFlow-GRPO, incorporating dense, pixel-level rewards to guide region-aware improvements in T2I generation. Across RichHF-18K and cross-domain benchmarks, ImageDoctor achieves strong human-alignment as a metric, verifier, and reward function, and DenseFlow-GRPO yields the most robust gains in local detail fidelity and alignment with human preferences.

Abstract

The rapid advancement of text-to-image (T2I) models has increased the need for reliable human preference modeling, a demand further amplified by recent progress in reinforcement learning for preference alignment. However, existing approaches typically quantify the quality of a generated image using a single scalar, limiting their ability to provide comprehensive and interpretable feedback on image quality. To address this, we introduce ImageDoctor, a unified multi-aspect T2I model evaluation framework that assesses image quality across four complementary dimensions: plausibility, semantic alignment, aesthetics, and overall quality. ImageDoctor also provides pixel-level flaw indicators in the form of heatmaps, which highlight misaligned or implausible regions, and can be used as a dense reward for T2I model preference alignment. Inspired by the diagnostic process, we improve the detail sensitivity and reasoning capability of ImageDoctor by introducing a "look-think-predict" paradigm, where the model first localizes potential flaws, then generates reasoning, and finally concludes the evaluation with quantitative scores. Built on top of a vision-language model and trained through a combination of supervised fine-tuning and reinforcement learning, ImageDoctor demonstrates strong alignment with human preference across multiple datasets, establishing its effectiveness as an evaluation metric. Furthermore, when used as a reward model for preference tuning, ImageDoctor significantly improves generation quality -- achieving an improvement of 10% over scalar-based reward models.

Paper Structure

This paper contains 33 sections, 8 equations, 9 figures, 6 tables.

Figures (9)

  • Figure 1: Comparison between ImageDoctor and scalar-based reward functions.Left: ImageDoctor follows a "look-think-predict" paradigm, providing rich feedback with four-dimensional scores and heatmaps that highlight misalignment and artifact locations. Right: Leveraging this fine-grained feedback, DenseFlow-GRPO (Sec. \ref{['sec:denseflow-grpo']}) generates images with more faithful and realistic local details, outperforming Flow-GRPO, which relies on the scalar-based reward PickScore.
  • Figure 2: ImageDoctor architecture. Given a prompt-image pair, the MLLM follows a "look-think-predict" paradigm for T2I evaluation by localizing potential flaw regions, analyzing them, and generating holistic scores and special task tokens. The task token, with a learned heatmap token and image features are fed into the heatmap decoder to produce the misalignment and artifact heatmaps.
  • Figure 3: Visualization of misalignment and artifact heatmaps.
  • Figure 4: Qualitative comparison on selected images by different verifiers in test-time scaling. ImageDoctor picks the images that faithfully reflect the text prompt (top) and preserve realistic object scale (bottom).
  • Figure 5: Flow-GRPO vs. DenseFlow-GRPO. The artifacts are boxed.
  • ...and 4 more figures