Table of Contents
Fetching ...

Tangram: Benchmark for Evaluating Geometric Element Recognition in Large Multimodal Models

Chao Zhang, Jiamin Tang, Jing Xiao

TL;DR

Tangram introduces a focused benchmark for geometric element recognition in large multimodal models, assembling 1,080 diagrams with 4,320 counting-based questions across three difficulty levels. The study systematically evaluates 13 LMMs, finding the top accuracy at 53.0% and substantial gaps relative to human performance, underscoring limitations in basic geometric perception. Tangram’s fine-grained annotations, uncontaminated data, and analysis of element types, diagram types, and prompting strategies offer a clear target for advancing multimodal visual understanding and geometric reasoning. The authors provide a dataset and evaluation protocol to drive development of next-generation multimodal foundations capable of reliable geometric diagram interpretation.

Abstract

Significant advancements in Large Multimodal Models (LMMs) have enabled them to tackle complex problems involving visual-mathematical reasoning. However, their ability to identify geometric elements remains underexplored. To address this gap, we introduce Tangram, a novel benchmark designed to evaluate the performance of LMMs on geometric element recognition. Tangram comprises 1,080 diverse geometric diagrams sourced from primary and secondary school exams, competitions, and textbooks, ranging from simple geometric shapes to complex combinations. Each diagram is paired with four questions, resulting in 4,320 visual-question-answer pairs. Unlike existing benchmarks that emphasize higher-level cognition and reasoning, Tangram focuses on understanding geometric elements, requiring models to perform a ``simple yet challenging" counting task. Systematic evaluation of 13 prominent LMMs, such as GPT-4o and Claude 3.5 Sonnet, reveals that these models face significant challenges even in seemingly straightforward tasks. The top-performing model achieves an accuracy of only 53.0%, highlighting a substantial gap compared to human performance. These findings underscore the limitations of current multimodal AI systems in handling basic perception tasks and serve to inspire the development of the next generation of expert-level multimodal foundational models. The data and code will be released soon.

Tangram: Benchmark for Evaluating Geometric Element Recognition in Large Multimodal Models

TL;DR

Tangram introduces a focused benchmark for geometric element recognition in large multimodal models, assembling 1,080 diagrams with 4,320 counting-based questions across three difficulty levels. The study systematically evaluates 13 LMMs, finding the top accuracy at 53.0% and substantial gaps relative to human performance, underscoring limitations in basic geometric perception. Tangram’s fine-grained annotations, uncontaminated data, and analysis of element types, diagram types, and prompting strategies offer a clear target for advancing multimodal visual understanding and geometric reasoning. The authors provide a dataset and evaluation protocol to drive development of next-generation multimodal foundations capable of reliable geometric diagram interpretation.

Abstract

Significant advancements in Large Multimodal Models (LMMs) have enabled them to tackle complex problems involving visual-mathematical reasoning. However, their ability to identify geometric elements remains underexplored. To address this gap, we introduce Tangram, a novel benchmark designed to evaluate the performance of LMMs on geometric element recognition. Tangram comprises 1,080 diverse geometric diagrams sourced from primary and secondary school exams, competitions, and textbooks, ranging from simple geometric shapes to complex combinations. Each diagram is paired with four questions, resulting in 4,320 visual-question-answer pairs. Unlike existing benchmarks that emphasize higher-level cognition and reasoning, Tangram focuses on understanding geometric elements, requiring models to perform a ``simple yet challenging" counting task. Systematic evaluation of 13 prominent LMMs, such as GPT-4o and Claude 3.5 Sonnet, reveals that these models face significant challenges even in seemingly straightforward tasks. The top-performing model achieves an accuracy of only 53.0%, highlighting a substantial gap compared to human performance. These findings underscore the limitations of current multimodal AI systems in handling basic perception tasks and serve to inspire the development of the next generation of expert-level multimodal foundational models. The data and code will be released soon.
Paper Structure (30 sections, 8 figures, 8 tables)

This paper contains 30 sections, 8 figures, 8 tables.

Figures (8)

  • Figure 1: Toy example of testing GPT-4o's accuracy in recognizing geometric elements in the given diagram, with correct answers highlighted in green and errors highlighted in red.
  • Figure 2: An example from our proposed Tangram. Each diagram is paired with four questions that involve counting geometric elements, including letters, circles, triangles and line segments.
  • Figure 3: Examples of geometric diagrams in Tangram categorized by difficulty, showing the number and types of elements in each category.
  • Figure 4: Accuracy(%) comparison between closed-source and open-source LMMs.
  • Figure 5: Performance of models on different types of geometric elements.
  • ...and 3 more figures