Table of Contents
Fetching ...

Polymath: A Challenging Multi-modal Mathematical Reasoning Benchmark

Himanshu Gupta, Shreyas Verma, Ujjwala Anantheswaran, Kevin Scaria, Mihir Parmar, Swaroop Mishra, Chitta Baral

TL;DR

PolyMATH introduces a 5000-question multi-modal benchmark to rigorously assess cognitive and mathematical reasoning in MLLMs across ten categories, with and without diagrams. The authors systematically curate the dataset, provide diagram descriptions for text-only evaluation, and evaluate a broad set of closed- and open-source models under multiple prompting strategies. Results reveal substantial gaps between current models and human performance, especially on diagram-heavy tasks, though OpenAI o1 models approach human performance on text-only variants. The work provides targeted insights into failure modes, notably spatial misunderstanding and logical errors, and offers a clear direction for future advancements in visual-math reasoning for multimodal models.

Abstract

Multi-modal Large Language Models (MLLMs) exhibit impressive problem-solving abilities in various domains, but their visual comprehension and abstract reasoning skills remain under-evaluated. To this end, we present PolyMATH, a challenging benchmark aimed at evaluating the general cognitive reasoning abilities of MLLMs. PolyMATH comprises 5,000 manually collected high-quality images of cognitive textual and visual challenges across 10 distinct categories, including pattern recognition, spatial reasoning, and relative reasoning. We conducted a comprehensive, and quantitative evaluation of 15 MLLMs using four diverse prompting strategies, including Chain-of-Thought and Step-Back. The best scores achieved on PolyMATH are ~41%, ~36%, and ~27%, obtained by Claude-3.5 Sonnet, GPT-4o and Gemini-1.5 Pro respectively - highlighting the logical and visual complexity of these questions. A further fine-grained error analysis reveals that these models struggle to understand spatial relations and perform drawn-out, high-level reasoning. This is further strengthened by our ablation study estimating MLLM performance when given textual descriptions in place of diagrams. As evidenced by ~4% improvement over textual descriptions as opposed to actual images, we discover that models do not truly comprehend visual diagrams and the spatial information therein, and are thus prone to logical errors. Finally, we evaluate the OpenAI o1 models and find that their performance only matches the human baseline, highlighting the difficulty of the benchmark. The results on PolyMATH highlight the room for improvement in multi-modal reasoning and provide unique insights to guide the development of future MLLMs.

Polymath: A Challenging Multi-modal Mathematical Reasoning Benchmark

TL;DR

PolyMATH introduces a 5000-question multi-modal benchmark to rigorously assess cognitive and mathematical reasoning in MLLMs across ten categories, with and without diagrams. The authors systematically curate the dataset, provide diagram descriptions for text-only evaluation, and evaluate a broad set of closed- and open-source models under multiple prompting strategies. Results reveal substantial gaps between current models and human performance, especially on diagram-heavy tasks, though OpenAI o1 models approach human performance on text-only variants. The work provides targeted insights into failure modes, notably spatial misunderstanding and logical errors, and offers a clear direction for future advancements in visual-math reasoning for multimodal models.

Abstract

Multi-modal Large Language Models (MLLMs) exhibit impressive problem-solving abilities in various domains, but their visual comprehension and abstract reasoning skills remain under-evaluated. To this end, we present PolyMATH, a challenging benchmark aimed at evaluating the general cognitive reasoning abilities of MLLMs. PolyMATH comprises 5,000 manually collected high-quality images of cognitive textual and visual challenges across 10 distinct categories, including pattern recognition, spatial reasoning, and relative reasoning. We conducted a comprehensive, and quantitative evaluation of 15 MLLMs using four diverse prompting strategies, including Chain-of-Thought and Step-Back. The best scores achieved on PolyMATH are ~41%, ~36%, and ~27%, obtained by Claude-3.5 Sonnet, GPT-4o and Gemini-1.5 Pro respectively - highlighting the logical and visual complexity of these questions. A further fine-grained error analysis reveals that these models struggle to understand spatial relations and perform drawn-out, high-level reasoning. This is further strengthened by our ablation study estimating MLLM performance when given textual descriptions in place of diagrams. As evidenced by ~4% improvement over textual descriptions as opposed to actual images, we discover that models do not truly comprehend visual diagrams and the spatial information therein, and are thus prone to logical errors. Finally, we evaluate the OpenAI o1 models and find that their performance only matches the human baseline, highlighting the difficulty of the benchmark. The results on PolyMATH highlight the room for improvement in multi-modal reasoning and provide unique insights to guide the development of future MLLMs.

Paper Structure

This paper contains 31 sections, 26 figures, 19 tables.

Figures (26)

  • Figure 1: Examples of the reasoning patterns employed by MLLMs when faced with questions involving visual information. In the top row, models fail to perceive the relationship between adjacent semicircles; in the bottom row, models fail to comprehend fine details in the answer images.
  • Figure 3: Examples of with diagram and without diagram questions. In addition to the question image, PolyMATH includes the metadata shown above. Question without diagram is not present in test-img while both kinds of questions will be present in testmini.
  • Figure 4: Frequency of Logical Flaw (LF) and Spatial Misunderstanding (SM) errors across different question categories. We report per-model figures to enable a comparison of model abilities. They are most prevalent in the OD, PR, and SC categories of questions, owing to the amount of logical leaps and visual reasoning required by these questions.
  • Figure 5: Questions belonging to the figure_completion (FC) category
  • Figure 6: Questions belonging to the logical_reasoning (LR) category
  • ...and 21 more figures