TIQA: Human-Aligned Text Quality Assessment in Generated Images

Kirill Koltsov; Aleksandr Gushchin; Dmitriy Vatolin; Anastasia Antsiferova

TIQA: Human-Aligned Text Quality Assessment in Generated Images

Kirill Koltsov, Aleksandr Gushchin, Dmitriy Vatolin, Anastasia Antsiferova

TL;DR

TIQA models are valuable in downstream tasks: for example, selecting the best-of-5 generations with ANTIQA improves human-rated text quality by $+14\%$ on average, demonstrating practical value for filtering and reranking in generation pipelines.

Abstract

Text rendering remains a persistent failure mode of modern text-to-image models (T2I), yet existing evaluations rely on OCR correctness or VLM-based judging procedures that are poorly aligned with perceptual text artifacts. We introduce Text-in-Image Quality Assessment (TIQA), a task that predicts a scalar quality score that matches human judgments of rendered-text fidelity within cropped text regions. We release two MOS-labeled datasets: TIQA-Crops (10k text crops) and TIQA-Images (1,500 images), spanning 20+ T2I models, including proprietary ones. We also propose ANTIQA, a lightweight method with text-specific biases, and show that it improves correlation with human scores over OCR confidence, VLM judges, and generic NR-IQA metrics by at least $\sim0.05$ on TIQA-Crops and $\sim0.08$ on TIQA-Images, as measured by PLCC. Finally, we show that TIQA models are valuable in downstream tasks: for example, selecting the best-of-5 generations with ANTIQA improves human-rated text quality by $+14\%$ on average, demonstrating practical value for filtering and reranking in generation pipelines.

TIQA: Human-Aligned Text Quality Assessment in Generated Images

TL;DR

TIQA models are valuable in downstream tasks: for example, selecting the best-of-5 generations with ANTIQA improves human-rated text quality by

on average, demonstrating practical value for filtering and reranking in generation pipelines.

Abstract

on TIQA-Crops and

on TIQA-Images, as measured by PLCC. Finally, we show that TIQA models are valuable in downstream tasks: for example, selecting the best-of-5 generations with ANTIQA improves human-rated text quality by

on average, demonstrating practical value for filtering and reranking in generation pipelines.

Paper Structure (44 sections, 10 equations, 13 figures, 7 tables)

This paper contains 44 sections, 10 equations, 13 figures, 7 tables.

Introduction
Related Work
Text-in-Image Quality Assessment (TIQA) for AI images
Task definition
Downstream Tasks
Datasets for TIQA
TIQA-Crops dataset: training and in-domain evaluation
TIQA-Images: Text-Heavy Images from Modern T2I Models
Method: ANTIQA
Architecture
Training
Experiments
Experimental Setup
Results on TIQA-Crops
Results on TIQA-Images
...and 29 more sections

Figures (13)

Figure 1: Examples of text rendering artifacts in AI-generated images across multiple generators. Even when text remains partially readable, humans penalize visual artifacts. TIQA is the task of assessing these perceptual failures rather than semantic correctness.
Figure 2: Overview of Text-in-Image Quality Assessment (TIQA). Left: AI-generated images contain multiple text regions that are detected and cropped. Middle: a TIQA model predicts a scalar text-quality score for each crop, trained on mean opinion scores (MOS). Right: representative model families used as baselines (VLM judges, OCR confidence, generic IQA) and the proposed specialized TIQA model. Bottom: example applications of TIQA for measuring generator quality, filtering candidates in production pipelines (best-of-K), and optimizing generation via reranking or closed-loop control.
Figure 3: ANTIQA architecture. Each text crop is converted to grayscale, concatenated with a Sobel edge map, and then processed by a lightweight multi-scale CNN with residual stages and downsampling. Features from multiple resolutions are pooled to fixed grids using adaptive average and max pooling, fused via an MLP head, and regressed to a single MOS prediction.
Figure 4: Box-plot distributions of OQ-MOS and TQ-MOS for separate generators. The models are sorted by mean TQ-MOS.
Figure 5: Visualization of crop detections with artifacts from different detectors. The red areas visualize the text detected by the detector.
...and 8 more figures

TIQA: Human-Aligned Text Quality Assessment in Generated Images

TL;DR

Abstract

TIQA: Human-Aligned Text Quality Assessment in Generated Images

Authors

TL;DR

Abstract

Table of Contents

Figures (13)