Table of Contents
Fetching ...

TIQA: Human-Aligned Text Quality Assessment in Generated Images

Kirill Koltsov, Aleksandr Gushchin, Dmitriy Vatolin, Anastasia Antsiferova

TL;DR

TIQA models are valuable in downstream tasks: for example, selecting the best-of-5 generations with ANTIQA improves human-rated text quality by $+14\%$ on average, demonstrating practical value for filtering and reranking in generation pipelines.

Abstract

Text rendering remains a persistent failure mode of modern text-to-image models (T2I), yet existing evaluations rely on OCR correctness or VLM-based judging procedures that are poorly aligned with perceptual text artifacts. We introduce Text-in-Image Quality Assessment (TIQA), a task that predicts a scalar quality score that matches human judgments of rendered-text fidelity within cropped text regions. We release two MOS-labeled datasets: TIQA-Crops (10k text crops) and TIQA-Images (1,500 images), spanning 20+ T2I models, including proprietary ones. We also propose ANTIQA, a lightweight method with text-specific biases, and show that it improves correlation with human scores over OCR confidence, VLM judges, and generic NR-IQA metrics by at least $\sim0.05$ on TIQA-Crops and $\sim0.08$ on TIQA-Images, as measured by PLCC. Finally, we show that TIQA models are valuable in downstream tasks: for example, selecting the best-of-5 generations with ANTIQA improves human-rated text quality by $+14\%$ on average, demonstrating practical value for filtering and reranking in generation pipelines.

TIQA: Human-Aligned Text Quality Assessment in Generated Images

TL;DR

TIQA models are valuable in downstream tasks: for example, selecting the best-of-5 generations with ANTIQA improves human-rated text quality by on average, demonstrating practical value for filtering and reranking in generation pipelines.

Abstract

Text rendering remains a persistent failure mode of modern text-to-image models (T2I), yet existing evaluations rely on OCR correctness or VLM-based judging procedures that are poorly aligned with perceptual text artifacts. We introduce Text-in-Image Quality Assessment (TIQA), a task that predicts a scalar quality score that matches human judgments of rendered-text fidelity within cropped text regions. We release two MOS-labeled datasets: TIQA-Crops (10k text crops) and TIQA-Images (1,500 images), spanning 20+ T2I models, including proprietary ones. We also propose ANTIQA, a lightweight method with text-specific biases, and show that it improves correlation with human scores over OCR confidence, VLM judges, and generic NR-IQA metrics by at least on TIQA-Crops and on TIQA-Images, as measured by PLCC. Finally, we show that TIQA models are valuable in downstream tasks: for example, selecting the best-of-5 generations with ANTIQA improves human-rated text quality by on average, demonstrating practical value for filtering and reranking in generation pipelines.
Paper Structure (44 sections, 10 equations, 13 figures, 7 tables)

This paper contains 44 sections, 10 equations, 13 figures, 7 tables.

Figures (13)

  • Figure 1: Examples of text rendering artifacts in AI-generated images across multiple generators. Even when text remains partially readable, humans penalize visual artifacts. TIQA is the task of assessing these perceptual failures rather than semantic correctness.
  • Figure 2: Overview of Text-in-Image Quality Assessment (TIQA). Left: AI-generated images contain multiple text regions that are detected and cropped. Middle: a TIQA model predicts a scalar text-quality score for each crop, trained on mean opinion scores (MOS). Right: representative model families used as baselines (VLM judges, OCR confidence, generic IQA) and the proposed specialized TIQA model. Bottom: example applications of TIQA for measuring generator quality, filtering candidates in production pipelines (best-of-K), and optimizing generation via reranking or closed-loop control.
  • Figure 3: ANTIQA architecture. Each text crop is converted to grayscale, concatenated with a Sobel edge map, and then processed by a lightweight multi-scale CNN with residual stages and downsampling. Features from multiple resolutions are pooled to fixed grids using adaptive average and max pooling, fused via an MLP head, and regressed to a single MOS prediction.
  • Figure 4: Box-plot distributions of OQ-MOS and TQ-MOS for separate generators. The models are sorted by mean TQ-MOS.
  • Figure 5: Visualization of crop detections with artifacts from different detectors. The red areas visualize the text detected by the detector.
  • ...and 8 more figures