Evaluating Numerical Reasoning in Text-to-Image Models

Ivana Kajić; Olivia Wiles; Isabela Albuquerque; Matthias Bauer; Su Wang; Jordi Pont-Tuset; Aida Nematzadeh

Evaluating Numerical Reasoning in Text-to-Image Models

Ivana Kajić, Olivia Wiles, Isabela Albuquerque, Matthias Bauer, Su Wang, Jordi Pont-Tuset, Aida Nematzadeh

TL;DR

The paper introduces GeckoNum, a controlled benchmark for evaluating numerical reasoning in text-to-image models across three tasks: exact number generation, approximate number generation with zero, and conceptual quantitative reasoning. By employing 12 prompts templates and human annotations over 1,386 prompts and 52,721 images, it demonstrates that state-of-the-art models exhibit only rudimentary numerical skills, with accuracy limited to small exact counts and highly sensitive to prompt structure and representation. The study also analyzes auto-eval metrics and investigates counting in vision-language models using PaLIGemma, showing that fine-tuning and synthetic data can modestly improve counting in some settings but generalization remains limited. GeckoNum provides a foundation for more reliable evaluation and targeted improvements in numerical cognition for text-to-image systems, highlighting significant gaps and guiding future research in metrics, model design, and training data strategies.

Abstract

Text-to-image generative models are capable of producing high-quality images that often faithfully depict concepts described using natural language. In this work, we comprehensively evaluate a range of text-to-image models on numerical reasoning tasks of varying difficulty, and show that even the most advanced models have only rudimentary numerical skills. Specifically, their ability to correctly generate an exact number of objects in an image is limited to small numbers, it is highly dependent on the context the number term appears in, and it deteriorates quickly with each successive number. We also demonstrate that models have poor understanding of linguistic quantifiers (such as "a few" or "as many as"), the concept of zero, and struggle with more advanced concepts such as partial quantities and fractional representations. We bundle prompts, generated images and human annotations into GeckoNum, a novel benchmark for evaluation of numerical reasoning.

Evaluating Numerical Reasoning in Text-to-Image Models

TL;DR

Abstract

Paper Structure (46 sections, 21 figures, 14 tables)

This paper contains 46 sections, 21 figures, 14 tables.

Introduction
Related Work
Tasks to Examine Numerical Reasoning
Task 1: Exact Number Generation
Task 2: Approximate Number Generation and Zero
Task 3: Conceptual Quantitative Reasoning
Human Annotations of Images
Evaluating Text-to-Image Models
Task 1: Exact Number Generation
Task 2: Approximate Number Generation and Zero
Task 3: Conceptual Quantitative Reasoning
Measuring What Counts: Challenges in Evaluation of Numerical Reasoning
Evaluating counting in vision-language models (VLMs).
Discussion and Conclusion
Acknowledgments.
...and 31 more sections

Figures (21)

Figure 1: Examples of images generated by selected models: DALL·E 3, Imagen-C and Muse-B. Correctly generated images are marked with a check mark "✓", and incorrect with a cross mark "✗".
Figure 2: Three types of annotation templates used to collect data for the evaluation of text-to-image models on three numerical reasoning tasks.
Figure 3: Accuracy of models on each prompt type for a subset of prompts that contain small numbers (i.e. 1--4) and a smaller subset of nouns.
Figure 4: Top: The confusion matrices for A) DALL·E 3 and B) Midjourney v6 on numeric-simple prompts. Bottom: The effect of C) number representation and D) word frequencies in Task 1. 95% bootstrap confidence intervals are shown.
Figure 5: Accuracy for approx-1-entity and approx-2-entity prompts.
...and 16 more figures

Evaluating Numerical Reasoning in Text-to-Image Models

TL;DR

Abstract

Evaluating Numerical Reasoning in Text-to-Image Models

Authors

TL;DR

Abstract

Table of Contents

Figures (21)