Table of Contents
Fetching ...

FractalBench: Diagnosing Visual-Mathematical Reasoning Through Recursive Program Synthesis

Jan Ondras, Marek Šuppa

TL;DR

FractalBench investigates whether multimodal large language models can infer recursive fractal generation rules from visual evidence by evaluating 12 fractals defined via Iterated Function Systems on 7,320 samples. The framework isolates visual-to-symbolic abstraction requirements—scale invariance, precise geometric transformations, recursive structure, compositional reasoning, and branching recursion—and uses a MinimalTurtle-based code-generation pipeline with a 95% IoU correctness criterion. Across four evaluated models and three prompting strategies, the study finds a large gap between syntactic code execution (76.1% Run) and semantic visual correctness (4.2% correct), with performance dropping sharply on branching recursion (trees <2%) and moderate success on Koch-type geometric transformations (17-21%). The work introduces a contamination-resistant diagnostic for visual-mathematical reasoning, revealing that current systems largely capture local geometric operations but struggle to infer true generative recursive rules, with implications for mathematical AI and future benchmark design.

Abstract

Mathematical reasoning requires abstracting symbolic rules from visual patterns -- inferring the infinite from the finite. We investigate whether multimodal AI systems possess this capability through FractalBench, a benchmark evaluating fractal program synthesis from images. Fractals provide ideal test cases: Iterated Function Systems with only a few contraction maps generate complex self-similar patterns through simple recursive rules, requiring models to bridge visual perception with mathematical abstraction. We evaluate four leading MLLMs -- GPT-4o, Claude 3.7 Sonnet, Gemini 2.5 Flash, and Qwen 2.5-VL -- on 12 canonical fractals. Models must generate executable Python code reproducing the fractal, enabling objective evaluation. Results reveal a striking disconnect: 76% generate syntactically valid code but only 4% capture mathematical structure. Success varies systematically -- models handle geometric transformations (Koch curves: 17-21%) but fail at branching recursion (trees: <2%), revealing fundamental gaps in mathematical abstraction. FractalBench provides a contamination-resistant diagnostic for visual-mathematical reasoning and is available at https://github.com/NaiveNeuron/FractalBench

FractalBench: Diagnosing Visual-Mathematical Reasoning Through Recursive Program Synthesis

TL;DR

FractalBench investigates whether multimodal large language models can infer recursive fractal generation rules from visual evidence by evaluating 12 fractals defined via Iterated Function Systems on 7,320 samples. The framework isolates visual-to-symbolic abstraction requirements—scale invariance, precise geometric transformations, recursive structure, compositional reasoning, and branching recursion—and uses a MinimalTurtle-based code-generation pipeline with a 95% IoU correctness criterion. Across four evaluated models and three prompting strategies, the study finds a large gap between syntactic code execution (76.1% Run) and semantic visual correctness (4.2% correct), with performance dropping sharply on branching recursion (trees <2%) and moderate success on Koch-type geometric transformations (17-21%). The work introduces a contamination-resistant diagnostic for visual-mathematical reasoning, revealing that current systems largely capture local geometric operations but struggle to infer true generative recursive rules, with implications for mathematical AI and future benchmark design.

Abstract

Mathematical reasoning requires abstracting symbolic rules from visual patterns -- inferring the infinite from the finite. We investigate whether multimodal AI systems possess this capability through FractalBench, a benchmark evaluating fractal program synthesis from images. Fractals provide ideal test cases: Iterated Function Systems with only a few contraction maps generate complex self-similar patterns through simple recursive rules, requiring models to bridge visual perception with mathematical abstraction. We evaluate four leading MLLMs -- GPT-4o, Claude 3.7 Sonnet, Gemini 2.5 Flash, and Qwen 2.5-VL -- on 12 canonical fractals. Models must generate executable Python code reproducing the fractal, enabling objective evaluation. Results reveal a striking disconnect: 76% generate syntactically valid code but only 4% capture mathematical structure. Success varies systematically -- models handle geometric transformations (Koch curves: 17-21%) but fail at branching recursion (trees: <2%), revealing fundamental gaps in mathematical abstraction. FractalBench provides a contamination-resistant diagnostic for visual-mathematical reasoning and is available at https://github.com/NaiveNeuron/FractalBench

Paper Structure

This paper contains 63 sections, 3 equations, 14 figures, 8 tables.

Figures (14)

  • Figure 1: Twelve canonical fractals testing different mathematical reasoning capabilities: linear recursion (Cantor), geometric transformations (Koch), multi-scale self-similarity (Sierpiński), space-filling curves (dragons), and branching recursion (trees). All defined via Iterated Function Systems.
  • Figure 2: Representative failure cases showing model-generated green fractals (left) versus ground truth (right). These six examples achieved low similarity scores, $\textrm{IoU} \in (0.010, 0.011)$, demonstrating cases where models produced visual output but failed to implement correct fractal structures.
  • Figure 3: Cantor Set: Synthesized code complexity (# non-blank, non-comment lines of code, averaged over colors) vs. recursion depth comparing all models for each prompting strategy.
  • Figure 4: Cantor Dust: Synthesized code complexity (# non-blank, non-comment lines of code, averaged over colors) vs. recursion depth comparing all models for each prompting strategy.
  • Figure 5: Koch Curve: Synthesized code complexity (# non-blank, non-comment lines of code, averaged over colors) vs. recursion depth comparing all models for each prompting strategy.
  • ...and 9 more figures