Table of Contents
Fetching ...

TurtleBench: A Visual Programming Benchmark in Turtle Geometry

Sina Rismanchian, Yasaman Razeghi, Sameer Singh, Shayan Doroudi

TL;DR

TurtleBench introduces a visual-to-code benchmark built on turtle geometry to evaluate large multimodal models (LMMs) on their ability to interpret geometric patterns and generate executable code. With 260 manually crafted tasks spanning Scratch and Tweak types and multiple input modalities, the benchmark uses an automatic OpenCV-based similarity pipeline to judge code outputs, reporting major gaps between human-like understanding and current LMM performance. Across state-of-the-art models, results show minimal gains from prompting enhancements and poor generalization to unseen command sets, underscoring fundamental challenges in integrating visual reasoning with programming. The work highlights educational and research implications, offering a benchmark to guide future progress in robust vision-to-code abilities and systematic evaluation of AI-assisted geometry education.

Abstract

Humans have the ability to reason about geometric patterns in images and scenes from a young age. However, developing large multimodal models (LMMs) capable of similar reasoning remains a challenge, highlighting the need for robust evaluation methods to assess these capabilities. We introduce \Turtle, a benchmark designed to evaluate LMMs' capacity to interpret geometric patterns -- given visual examples, textual instructions, or both -- and generate precise code outputs. Inspired by turtle geometry, a notion used to teach children foundational coding and geometric concepts, TurtleBench features tasks with patterned shapes that have underlying algorithmic logic. Our evaluation reveals that leading LMMs struggle significantly with these tasks, with GPT-4o achieving only 19\% accuracy on the simplest tasks and few-shot prompting only marginally improves their performance ($<2\%$). \Turtle highlights the gap between human and AI performance in intuitive and visual geometrical understanding, setting the stage for future research in this area. \Turtle stands as one of the few benchmarks to evaluate the integration of visual understanding and code generation capabilities in LMMs, setting the stage for future research. Code and Dataset for this paper is provided here: \href{https://github.com/sinaris76/TurtleBench}{https://github.com/sinaris76/TurtleBench}

TurtleBench: A Visual Programming Benchmark in Turtle Geometry

TL;DR

TurtleBench introduces a visual-to-code benchmark built on turtle geometry to evaluate large multimodal models (LMMs) on their ability to interpret geometric patterns and generate executable code. With 260 manually crafted tasks spanning Scratch and Tweak types and multiple input modalities, the benchmark uses an automatic OpenCV-based similarity pipeline to judge code outputs, reporting major gaps between human-like understanding and current LMM performance. Across state-of-the-art models, results show minimal gains from prompting enhancements and poor generalization to unseen command sets, underscoring fundamental challenges in integrating visual reasoning with programming. The work highlights educational and research implications, offering a benchmark to guide future progress in robust vision-to-code abilities and systematic evaluation of AI-assisted geometry education.

Abstract

Humans have the ability to reason about geometric patterns in images and scenes from a young age. However, developing large multimodal models (LMMs) capable of similar reasoning remains a challenge, highlighting the need for robust evaluation methods to assess these capabilities. We introduce \Turtle, a benchmark designed to evaluate LMMs' capacity to interpret geometric patterns -- given visual examples, textual instructions, or both -- and generate precise code outputs. Inspired by turtle geometry, a notion used to teach children foundational coding and geometric concepts, TurtleBench features tasks with patterned shapes that have underlying algorithmic logic. Our evaluation reveals that leading LMMs struggle significantly with these tasks, with GPT-4o achieving only 19\% accuracy on the simplest tasks and few-shot prompting only marginally improves their performance (). \Turtle highlights the gap between human and AI performance in intuitive and visual geometrical understanding, setting the stage for future research in this area. \Turtle stands as one of the few benchmarks to evaluate the integration of visual understanding and code generation capabilities in LMMs, setting the stage for future research. Code and Dataset for this paper is provided here: \href{https://github.com/sinaris76/TurtleBench}{https://github.com/sinaris76/TurtleBench}

Paper Structure

This paper contains 32 sections, 1 equation, 10 figures, 5 tables.

Figures (10)

  • Figure 1: An illustration of existing types and modes in TurtleBench.
  • Figure 2: An illustration of different modes of a single task in TurtleBench along with the images generated by code from the outputs of GPT-4o and Gemini 1.5 Flash. More examples are provided in Appendix Figure \ref{['fig:example2']}
  • Figure 3: basic prompt used in our experiments
  • Figure 4: v-CoT prompt used in our experiments
  • Figure 5: An example of a complete prompt for a tweak code generation task with using v-CoT prompting.
  • ...and 5 more figures