FractalBench: Diagnosing Visual-Mathematical Reasoning Through Recursive Program Synthesis
Jan Ondras, Marek Šuppa
TL;DR
FractalBench investigates whether multimodal large language models can infer recursive fractal generation rules from visual evidence by evaluating 12 fractals defined via Iterated Function Systems on 7,320 samples. The framework isolates visual-to-symbolic abstraction requirements—scale invariance, precise geometric transformations, recursive structure, compositional reasoning, and branching recursion—and uses a MinimalTurtle-based code-generation pipeline with a 95% IoU correctness criterion. Across four evaluated models and three prompting strategies, the study finds a large gap between syntactic code execution (76.1% Run) and semantic visual correctness (4.2% correct), with performance dropping sharply on branching recursion (trees <2%) and moderate success on Koch-type geometric transformations (17-21%). The work introduces a contamination-resistant diagnostic for visual-mathematical reasoning, revealing that current systems largely capture local geometric operations but struggle to infer true generative recursive rules, with implications for mathematical AI and future benchmark design.
Abstract
Mathematical reasoning requires abstracting symbolic rules from visual patterns -- inferring the infinite from the finite. We investigate whether multimodal AI systems possess this capability through FractalBench, a benchmark evaluating fractal program synthesis from images. Fractals provide ideal test cases: Iterated Function Systems with only a few contraction maps generate complex self-similar patterns through simple recursive rules, requiring models to bridge visual perception with mathematical abstraction. We evaluate four leading MLLMs -- GPT-4o, Claude 3.7 Sonnet, Gemini 2.5 Flash, and Qwen 2.5-VL -- on 12 canonical fractals. Models must generate executable Python code reproducing the fractal, enabling objective evaluation. Results reveal a striking disconnect: 76% generate syntactically valid code but only 4% capture mathematical structure. Success varies systematically -- models handle geometric transformations (Koch curves: 17-21%) but fail at branching recursion (trees: <2%), revealing fundamental gaps in mathematical abstraction. FractalBench provides a contamination-resistant diagnostic for visual-mathematical reasoning and is available at https://github.com/NaiveNeuron/FractalBench
