Table of Contents
Fetching ...

Evaluating Large Vision-and-Language Models on Children's Mathematical Olympiads

Anoop Cherian, Kuan-Chuan Peng, Suhas Lohit, Joanna Matthiesen, Kevin Smith, Joshua B. Tenenbaum

TL;DR

The paper investigates whether state-of-the-art large vision-and-language models exhibit human-like problem solving on mathematical reasoning tasks faced by children. It introduces SMART-840, a zero-shot MK-based dataset drawn from Math Kangaroo problems collected from 2020–2024, with both text-only and image-text formats across grades 1–12, and benchmarks several LVLMs (e.g., GPT-4o, Claude-3, Gemini-Pro) against children's performance. Findings show that LVLMs improve with grade level but generally trail human performance, especially for younger grades, and display weak or negative correlations with problem difficulty as perceived by children, suggesting fundamentally different reasoning regimes. The work emphasizes reliability and modality gaps in current LVLMs and provides a rigorous age-stratified benchmark to guide future multimodal reasoning research and evaluation. Overall, SMART-840 reveals essential gaps between machine cognition and human mathematical reasoning, informing future directions for improving general-purpose, horizon-expanding LVLM capabilities.

Abstract

Recent years have seen a significant progress in the general-purpose problem solving abilities of large vision and language models (LVLMs), such as ChatGPT, Gemini, etc.; some of these breakthroughs even seem to enable AI models to outperform human abilities in varied tasks that demand higher-order cognitive skills. Are the current large AI models indeed capable of generalized problem solving as humans do? A systematic analysis of AI capabilities for joint vision and text reasoning, however, is missing in the current scientific literature. In this paper, we make an effort towards filling this gap, by evaluating state-of-the-art LVLMs on their mathematical and algorithmic reasoning abilities using visuo-linguistic problems from children's Olympiads. Specifically, we consider problems from the Mathematical Kangaroo (MK) Olympiad, which is a popular international competition targeted at children from grades 1-12, that tests children's deeper mathematical abilities using puzzles that are appropriately gauged to their age and skills. Using the puzzles from MK, we created a dataset, dubbed SMART-840, consisting of 840 problems from years 2020-2024. With our dataset, we analyze LVLMs power on mathematical reasoning; their responses on our puzzles offer a direct way to compare against that of children. Our results show that modern LVLMs do demonstrate increasingly powerful reasoning skills in solving problems for higher grades, but lack the foundations to correctly answer problems designed for younger children. Further analysis shows that there is no significant correlation between the reasoning capabilities of AI models and that of young children, and their capabilities appear to be based on a different type of reasoning than the cumulative knowledge that underlies children's mathematics and logic skills.

Evaluating Large Vision-and-Language Models on Children's Mathematical Olympiads

TL;DR

The paper investigates whether state-of-the-art large vision-and-language models exhibit human-like problem solving on mathematical reasoning tasks faced by children. It introduces SMART-840, a zero-shot MK-based dataset drawn from Math Kangaroo problems collected from 2020–2024, with both text-only and image-text formats across grades 1–12, and benchmarks several LVLMs (e.g., GPT-4o, Claude-3, Gemini-Pro) against children's performance. Findings show that LVLMs improve with grade level but generally trail human performance, especially for younger grades, and display weak or negative correlations with problem difficulty as perceived by children, suggesting fundamentally different reasoning regimes. The work emphasizes reliability and modality gaps in current LVLMs and provides a rigorous age-stratified benchmark to guide future multimodal reasoning research and evaluation. Overall, SMART-840 reveals essential gaps between machine cognition and human mathematical reasoning, informing future directions for improving general-purpose, horizon-expanding LVLM capabilities.

Abstract

Recent years have seen a significant progress in the general-purpose problem solving abilities of large vision and language models (LVLMs), such as ChatGPT, Gemini, etc.; some of these breakthroughs even seem to enable AI models to outperform human abilities in varied tasks that demand higher-order cognitive skills. Are the current large AI models indeed capable of generalized problem solving as humans do? A systematic analysis of AI capabilities for joint vision and text reasoning, however, is missing in the current scientific literature. In this paper, we make an effort towards filling this gap, by evaluating state-of-the-art LVLMs on their mathematical and algorithmic reasoning abilities using visuo-linguistic problems from children's Olympiads. Specifically, we consider problems from the Mathematical Kangaroo (MK) Olympiad, which is a popular international competition targeted at children from grades 1-12, that tests children's deeper mathematical abilities using puzzles that are appropriately gauged to their age and skills. Using the puzzles from MK, we created a dataset, dubbed SMART-840, consisting of 840 problems from years 2020-2024. With our dataset, we analyze LVLMs power on mathematical reasoning; their responses on our puzzles offer a direct way to compare against that of children. Our results show that modern LVLMs do demonstrate increasingly powerful reasoning skills in solving problems for higher grades, but lack the foundations to correctly answer problems designed for younger children. Further analysis shows that there is no significant correlation between the reasoning capabilities of AI models and that of young children, and their capabilities appear to be based on a different type of reasoning than the cumulative knowledge that underlies children's mathematics and logic skills.
Paper Structure (12 sections, 3 figures, 11 tables)

This paper contains 12 sections, 3 figures, 11 tables.

Figures (3)

  • Figure 1: A 3rd grader puzzle from our SMART-840 dataset and the LVLM responses (both incorrect).
  • Figure 2: Figure \ref{['fig:participant_year_grade']} plots the distributions of children participating in MK Olympiads per year over 2020--2024 for grades 1--12. Figure \ref{['fig:participant_grade']} plots the total number of participants per grade during 2020--2024. Figure \ref{['fig:participant_year']} plots the total number of participants each year over all grades (1-12). Figure \ref{['fig:pie_category']} shows the number of puzzles and its portion for each category. Figure \ref{['fig:pie_image_text']} shows the statistics of image-text and text-only puzzles. Figure \ref{['fig:pie_difficulty']} shows the statistics of puzzle difficulty (defined by their attributed weights).
  • Figure 3: Comparison of the average accuracy (%) of humans and LVLMs on each category of the Olympiad problems with the corresponding radar plot.