Evaluating Numerical Reasoning in Text-to-Image Models
Ivana Kajić, Olivia Wiles, Isabela Albuquerque, Matthias Bauer, Su Wang, Jordi Pont-Tuset, Aida Nematzadeh
TL;DR
The paper introduces GeckoNum, a controlled benchmark for evaluating numerical reasoning in text-to-image models across three tasks: exact number generation, approximate number generation with zero, and conceptual quantitative reasoning. By employing 12 prompts templates and human annotations over 1,386 prompts and 52,721 images, it demonstrates that state-of-the-art models exhibit only rudimentary numerical skills, with accuracy limited to small exact counts and highly sensitive to prompt structure and representation. The study also analyzes auto-eval metrics and investigates counting in vision-language models using PaLIGemma, showing that fine-tuning and synthetic data can modestly improve counting in some settings but generalization remains limited. GeckoNum provides a foundation for more reliable evaluation and targeted improvements in numerical cognition for text-to-image systems, highlighting significant gaps and guiding future research in metrics, model design, and training data strategies.
Abstract
Text-to-image generative models are capable of producing high-quality images that often faithfully depict concepts described using natural language. In this work, we comprehensively evaluate a range of text-to-image models on numerical reasoning tasks of varying difficulty, and show that even the most advanced models have only rudimentary numerical skills. Specifically, their ability to correctly generate an exact number of objects in an image is limited to small numbers, it is highly dependent on the context the number term appears in, and it deteriorates quickly with each successive number. We also demonstrate that models have poor understanding of linguistic quantifiers (such as "a few" or "as many as"), the concept of zero, and struggle with more advanced concepts such as partial quantities and fractional representations. We bundle prompts, generated images and human annotations into GeckoNum, a novel benchmark for evaluation of numerical reasoning.
