Are Language Models Puzzle Prodigies? Algorithmic Puzzles Unveil Serious Challenges in Multimodal Reasoning
Deepanway Ghosal, Vernon Toh Yan Han, Chia Yew Ken, Soujanya Poria
TL;DR
This paper introduces AlgoPuzzleVQA, a large-scale multimodal puzzle dataset designed to probe vision-language-algorithmic reasoning through diverse, automatically generated puzzles with exact solutions. It evaluates multiple state-of-the-art multimodal LLMs under zero-shot and guided prompting regimes, revealing that performance is frequently near random, with the calendar-oriented tasks showing the strongest gains (e.g., up to ~57% in some setups). The authors perform ontological analyses and a guided-vision protocol to dissect where perception vs. reasoning bottlenecks lie, finding that visual features are easier for some models than abstract algorithmic reasoning, and that improvements from guided context are partial. Overall, the work highlights significant challenges in integrating visual understanding with complex algorithmic reasoning and provides a scalable dataset and methodology to push future research in multimodal reasoning systems.
Abstract
This paper introduces the novel task of multimodal puzzle solving, framed within the context of visual question-answering. We present a new dataset, AlgoPuzzleVQA designed to challenge and evaluate the capabilities of multimodal language models in solving algorithmic puzzles that necessitate both visual understanding, language understanding, and complex algorithmic reasoning. We create the puzzles to encompass a diverse array of mathematical and algorithmic topics such as boolean logic, combinatorics, graph theory, optimization, search, etc., aiming to evaluate the gap between visual data interpretation and algorithmic problem-solving skills. The dataset is generated automatically from code authored by humans. All our puzzles have exact solutions that can be found from the algorithm without tedious human calculations. It ensures that our dataset can be scaled up arbitrarily in terms of reasoning complexity and dataset size. Our investigation reveals that large language models (LLMs) such as GPT4V and Gemini exhibit limited performance in puzzle-solving tasks. We find that their performance is near random in a multi-choice question-answering setup for a significant number of puzzles. The findings emphasize the challenges of integrating visual, language, and algorithmic knowledge for solving complex reasoning problems.
