TurtleBench: A Visual Programming Benchmark in Turtle Geometry
Sina Rismanchian, Yasaman Razeghi, Sameer Singh, Shayan Doroudi
TL;DR
TurtleBench introduces a visual-to-code benchmark built on turtle geometry to evaluate large multimodal models (LMMs) on their ability to interpret geometric patterns and generate executable code. With 260 manually crafted tasks spanning Scratch and Tweak types and multiple input modalities, the benchmark uses an automatic OpenCV-based similarity pipeline to judge code outputs, reporting major gaps between human-like understanding and current LMM performance. Across state-of-the-art models, results show minimal gains from prompting enhancements and poor generalization to unseen command sets, underscoring fundamental challenges in integrating visual reasoning with programming. The work highlights educational and research implications, offering a benchmark to guide future progress in robust vision-to-code abilities and systematic evaluation of AI-assisted geometry education.
Abstract
Humans have the ability to reason about geometric patterns in images and scenes from a young age. However, developing large multimodal models (LMMs) capable of similar reasoning remains a challenge, highlighting the need for robust evaluation methods to assess these capabilities. We introduce \Turtle, a benchmark designed to evaluate LMMs' capacity to interpret geometric patterns -- given visual examples, textual instructions, or both -- and generate precise code outputs. Inspired by turtle geometry, a notion used to teach children foundational coding and geometric concepts, TurtleBench features tasks with patterned shapes that have underlying algorithmic logic. Our evaluation reveals that leading LMMs struggle significantly with these tasks, with GPT-4o achieving only 19\% accuracy on the simplest tasks and few-shot prompting only marginally improves their performance ($<2\%$). \Turtle highlights the gap between human and AI performance in intuitive and visual geometrical understanding, setting the stage for future research in this area. \Turtle stands as one of the few benchmarks to evaluate the integration of visual understanding and code generation capabilities in LMMs, setting the stage for future research. Code and Dataset for this paper is provided here: \href{https://github.com/sinaris76/TurtleBench}{https://github.com/sinaris76/TurtleBench}
