Table of Contents
Fetching ...

Forgotten Polygons: Multimodal Large Language Models are Shape-Blind

William Rudman, Michal Golovanevsky, Amir Bar, Vedant Palit, Yann LeCun, Carsten Eickhoff, Ritambhara Singh

TL;DR

Forgotten Polygons investigates why Multimodal LLMs struggle with visual-math reasoning, especially counting sides of polygons. The authors dissect the problem into vision and language components, revealing that vision encoders are shape-blind and LLMs rely on memorized associations rather than true geometric reasoning. They introduce Visually-Cued Chain-of-Thought (VC-CoT), prompting models to reference visual annotations, which dramatically improves performance on side-count tasks (e.g., from 7% to 93% for GPT-4o) and provides a path toward engaging System 2 reasoning in MLLMs. The work underscores the need for visual grounding and targeted prompting to bridge perception and reasoning in multimodal models, with open questions about generalization to real-world data and broader geometric tasks.

Abstract

Despite strong performance on vision-language tasks, Multimodal Large Language Models (MLLMs) struggle with mathematical problem-solving, with both open-source and state-of-the-art models falling short of human performance on visual-math benchmarks. To systematically examine visual-mathematical reasoning in MLLMs, we (1) evaluate their understanding of geometric primitives, (2) test multi-step reasoning, and (3) explore a potential solution to improve visual reasoning capabilities. Our findings reveal fundamental shortcomings in shape recognition, with top models achieving under 50% accuracy in identifying regular polygons. We analyze these failures through the lens of dual-process theory and show that MLLMs rely on System 1 (intuitive, memorized associations) rather than System 2 (deliberate reasoning). Consequently, MLLMs fail to count the sides of both familiar and novel shapes, suggesting they have neither learned the concept of sides nor effectively process visual inputs. Finally, we propose Visually Cued Chain-of-Thought (VC-CoT) prompting, which enhances multi-step mathematical reasoning by explicitly referencing visual annotations in diagrams, boosting GPT-4o's accuracy on an irregular polygon side-counting task from 7% to 93%. Our findings suggest that System 2 reasoning in MLLMs remains an open problem, and visually-guided prompting is essential for successfully engaging visual reasoning. Code available at: https://github.com/rsinghlab/Shape-Blind.

Forgotten Polygons: Multimodal Large Language Models are Shape-Blind

TL;DR

Forgotten Polygons investigates why Multimodal LLMs struggle with visual-math reasoning, especially counting sides of polygons. The authors dissect the problem into vision and language components, revealing that vision encoders are shape-blind and LLMs rely on memorized associations rather than true geometric reasoning. They introduce Visually-Cued Chain-of-Thought (VC-CoT), prompting models to reference visual annotations, which dramatically improves performance on side-count tasks (e.g., from 7% to 93% for GPT-4o) and provides a path toward engaging System 2 reasoning in MLLMs. The work underscores the need for visual grounding and targeted prompting to bridge perception and reasoning in multimodal models, with open questions about generalization to real-world data and broader geometric tasks.

Abstract

Despite strong performance on vision-language tasks, Multimodal Large Language Models (MLLMs) struggle with mathematical problem-solving, with both open-source and state-of-the-art models falling short of human performance on visual-math benchmarks. To systematically examine visual-mathematical reasoning in MLLMs, we (1) evaluate their understanding of geometric primitives, (2) test multi-step reasoning, and (3) explore a potential solution to improve visual reasoning capabilities. Our findings reveal fundamental shortcomings in shape recognition, with top models achieving under 50% accuracy in identifying regular polygons. We analyze these failures through the lens of dual-process theory and show that MLLMs rely on System 1 (intuitive, memorized associations) rather than System 2 (deliberate reasoning). Consequently, MLLMs fail to count the sides of both familiar and novel shapes, suggesting they have neither learned the concept of sides nor effectively process visual inputs. Finally, we propose Visually Cued Chain-of-Thought (VC-CoT) prompting, which enhances multi-step mathematical reasoning by explicitly referencing visual annotations in diagrams, boosting GPT-4o's accuracy on an irregular polygon side-counting task from 7% to 93%. Our findings suggest that System 2 reasoning in MLLMs remains an open problem, and visually-guided prompting is essential for successfully engaging visual reasoning. Code available at: https://github.com/rsinghlab/Shape-Blind.

Paper Structure

This paper contains 24 sections, 10 figures, 13 tables.

Figures (10)

  • Figure 1: Multimodal Large Language Models (MLLMs), such as GPT-4o, Janus-Pro, and Molmo fail at counting the number of sides of novel shapes.
  • Figure 2: T-SNE plots of vision encoder embeddings from LLaVA-OneVision. Only triangles and squares form distinct clusters. Appendix \ref{['app:vision_encoder_experiments']} shows all models learn a similar embedding.
  • Figure 3: Illustration of failure modes in the two-shape reasoning task. A: Successful completion of all steps. B: The most common failure mode, where misidentification in Step 1 leads to an incorrect sum in Step 3. C: An error in mapping shapes to their number of sides (Step 2), affecting the final sum.
  • Figure 4: Examples of abstract shapes. For the full set of shapes, see Figure \ref{['app:abstract_shapes']}
  • Figure 5: Example outputs from GPT-4-Turbo on "random letters" annotations.
  • ...and 5 more figures