Table of Contents
Fetching ...

MathGen: Revealing the Illusion of Mathematical Competence through Text-to-Image Generation

Ruiyao Liu, Hui Shen, Ping Zhang, Yunta Hsieh, Yifan Zhang, Jing Xu, Sicheng Chen, Junchen Li, Jiawei Lu, Jianing Ma, Jiaqi Mo, Qi Han, Zhen Zhang, Zhongwei Wan, Jing Xiong, Xin Wang, Ziyuan Liu, Hangrui Cao, Ngai Wong

Abstract

Modern generative models have demonstrated the ability to solve challenging mathematical problems. In many real-world settings, however, mathematical solutions must be expressed visually through diagrams, plots, geometric constructions, and structured symbolic layouts, where correctness depends on precise visual composition. This naturally raises the question of whether generative models can still do so when the answer must be rendered visually rather than written in text? To study this problem, we introduce MathGen, a rigorous benchmark of 900 problems spanning seven core domains, each paired with an executable verifier under a Script-as-a-Judge protocol for deterministic and objective evaluation. Experiments on representative open-source and proprietary text-to-image models show that mathematical fidelity remains a major bottleneck: even the best closed-source model reaches only 42.0% overall accuracy, while open-source models achieve just ~ 1-11%, often near 0% on structured tasks. Overall, current T2I models remain far from competent at even elementary mathematical visual generation.

MathGen: Revealing the Illusion of Mathematical Competence through Text-to-Image Generation

Abstract

Modern generative models have demonstrated the ability to solve challenging mathematical problems. In many real-world settings, however, mathematical solutions must be expressed visually through diagrams, plots, geometric constructions, and structured symbolic layouts, where correctness depends on precise visual composition. This naturally raises the question of whether generative models can still do so when the answer must be rendered visually rather than written in text? To study this problem, we introduce MathGen, a rigorous benchmark of 900 problems spanning seven core domains, each paired with an executable verifier under a Script-as-a-Judge protocol for deterministic and objective evaluation. Experiments on representative open-source and proprietary text-to-image models show that mathematical fidelity remains a major bottleneck: even the best closed-source model reaches only 42.0% overall accuracy, while open-source models achieve just ~ 1-11%, often near 0% on structured tasks. Overall, current T2I models remain far from competent at even elementary mathematical visual generation.

Paper Structure

This paper contains 28 sections, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Task taxonomy of MathGen. MathGen covers seven fundamental mathematical domains. Example prompts and reference illustrations are shown to provide an intuitive overview of the mathematical concepts evaluated in each domain. They highlight the diverse forms of numerical, geometric, and structural constraints that generative models are required to express, and illustrate the types of visual outcomes expected under correct mathematical interpretation.
  • Figure 2: Performance comparison. The chart shows the accuracy of representative open-source and closed-source text-to-image models on each MathGen domain.
  • Figure 3: Overview of the MathGen benchmark and evaluation pipeline. MathGen evaluates text-to-image models on seven mathematical domains using structured prompts and automatic verification. Generated images are validated against domain-specific structural, geometric, and logical constraints. Each criterion $C_i$ is checked independently, and the final correctness is determined through logical aggregation.
  • Figure 4: Qualitative comparison of T2I models on representative MathGen tasks.
  • Figure 5: Typical successes and failures examples from the clean-scene and Open-Scene settings.
  • ...and 2 more figures