Table of Contents
Fetching ...

FLUX-Reason-6M & PRISM-Bench: A Million-Scale Text-to-Image Reasoning Dataset and Comprehensive Benchmark

Rongyao Fang, Aldrich Yu, Chengqi Duan, Linjiang Huang, Shuai Bai, Yuxuan Cai, Kun Wang, Si Liu, Xihui Liu, Hongsheng Li

TL;DR

This work tackles the lack of large-scale, reasoning-focused data and evaluation for text-to-image models by introducing FLUX-Reason-6M, a 6M-image dataset with 20M bilingual captions and generation chain-of-thought cues across six reasoning dimensions, and PRISM-Bench, a seven-track human-aligned benchmark. The authors design a VLM-driven data pipeline to synthesize high-quality imagery, annotate with dense, category-specific captions, and produce GCoT reasoning templates, followed by bilingual expansion to a 20M-caption corpus. They then establish PRISM-Bench with 700 prompts across seven tracks, evaluated by advanced vision-language models to yield fine-grained alignment and uniform aesthetics scores, tested on 19 models and PRISM-Bench-ZH. Across extensive experiments, closed-source models outperform open-source ones in most tracks, particularly on long-text and text-rendering tasks, highlighting the remaining challenges and the value of reasoning-focused datasets. By releasing the dataset, benchmark, and code, the work aims to democratize access and accelerate progress toward truly reasoning-capable open T2I systems.

Abstract

The advancement of open-source text-to-image (T2I) models has been hindered by the absence of large-scale, reasoning-focused datasets and comprehensive evaluation benchmarks, resulting in a performance gap compared to leading closed-source systems. To address this challenge, We introduce FLUX-Reason-6M and PRISM-Bench (Precise and Robust Image Synthesis Measurement Benchmark). FLUX-Reason-6M is a massive dataset consisting of 6 million high-quality FLUX-generated images and 20 million bilingual (English and Chinese) descriptions specifically designed to teach complex reasoning. The image are organized according to six key characteristics: Imagination, Entity, Text rendering, Style, Affection, and Composition, and design explicit Generation Chain-of-Thought (GCoT) to provide detailed breakdowns of image generation steps. The whole data curation takes 15,000 A100 GPU days, providing the community with a resource previously unattainable outside of large industrial labs. PRISM-Bench offers a novel evaluation standard with seven distinct tracks, including a formidable Long Text challenge using GCoT. Through carefully designed prompts, it utilizes advanced vision-language models for nuanced human-aligned assessment of prompt-image alignment and image aesthetics. Our extensive evaluation of 19 leading models on PRISM-Bench reveals critical performance gaps and highlights specific areas requiring improvement. Our dataset, benchmark, and evaluation code are released to catalyze the next wave of reasoning-oriented T2I generation. Project page: https://flux-reason-6m.github.io/ .

FLUX-Reason-6M & PRISM-Bench: A Million-Scale Text-to-Image Reasoning Dataset and Comprehensive Benchmark

TL;DR

This work tackles the lack of large-scale, reasoning-focused data and evaluation for text-to-image models by introducing FLUX-Reason-6M, a 6M-image dataset with 20M bilingual captions and generation chain-of-thought cues across six reasoning dimensions, and PRISM-Bench, a seven-track human-aligned benchmark. The authors design a VLM-driven data pipeline to synthesize high-quality imagery, annotate with dense, category-specific captions, and produce GCoT reasoning templates, followed by bilingual expansion to a 20M-caption corpus. They then establish PRISM-Bench with 700 prompts across seven tracks, evaluated by advanced vision-language models to yield fine-grained alignment and uniform aesthetics scores, tested on 19 models and PRISM-Bench-ZH. Across extensive experiments, closed-source models outperform open-source ones in most tracks, particularly on long-text and text-rendering tasks, highlighting the remaining challenges and the value of reasoning-focused datasets. By releasing the dataset, benchmark, and code, the work aims to democratize access and accelerate progress toward truly reasoning-capable open T2I systems.

Abstract

The advancement of open-source text-to-image (T2I) models has been hindered by the absence of large-scale, reasoning-focused datasets and comprehensive evaluation benchmarks, resulting in a performance gap compared to leading closed-source systems. To address this challenge, We introduce FLUX-Reason-6M and PRISM-Bench (Precise and Robust Image Synthesis Measurement Benchmark). FLUX-Reason-6M is a massive dataset consisting of 6 million high-quality FLUX-generated images and 20 million bilingual (English and Chinese) descriptions specifically designed to teach complex reasoning. The image are organized according to six key characteristics: Imagination, Entity, Text rendering, Style, Affection, and Composition, and design explicit Generation Chain-of-Thought (GCoT) to provide detailed breakdowns of image generation steps. The whole data curation takes 15,000 A100 GPU days, providing the community with a resource previously unattainable outside of large industrial labs. PRISM-Bench offers a novel evaluation standard with seven distinct tracks, including a formidable Long Text challenge using GCoT. Through carefully designed prompts, it utilizes advanced vision-language models for nuanced human-aligned assessment of prompt-image alignment and image aesthetics. Our extensive evaluation of 19 leading models on PRISM-Bench reveals critical performance gaps and highlights specific areas requiring improvement. Our dataset, benchmark, and evaluation code are released to catalyze the next wave of reasoning-oriented T2I generation. Project page: https://flux-reason-6m.github.io/ .

Paper Structure

This paper contains 39 sections, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Evaluation of state-of-the-art text-to-image models with the proposed PRISM-Bench.
  • Figure 2: Showcase of FLUX-Reason-6M in six different characteristics and generation chain of thought. Keywords related to characteristics in the captions are highlighted in color.
  • Figure 3: An overview of FLUX-Reason-6M data curation pipeline. The entire process was completed using 128 A100 GPUs over a period of 4 months.
  • Figure 4: Left: Three subsets of raw prompt sources. Middle: Image category ratio. Right: Prompt Suite Statistics.
  • Figure 5: An overview of the prompt design and evaluation protocol of PRISM-Bench.
  • ...and 2 more figures