Table of Contents
Fetching ...

Are Language Models Puzzle Prodigies? Algorithmic Puzzles Unveil Serious Challenges in Multimodal Reasoning

Deepanway Ghosal, Vernon Toh Yan Han, Chia Yew Ken, Soujanya Poria

TL;DR

This paper introduces AlgoPuzzleVQA, a large-scale multimodal puzzle dataset designed to probe vision-language-algorithmic reasoning through diverse, automatically generated puzzles with exact solutions. It evaluates multiple state-of-the-art multimodal LLMs under zero-shot and guided prompting regimes, revealing that performance is frequently near random, with the calendar-oriented tasks showing the strongest gains (e.g., up to ~57% in some setups). The authors perform ontological analyses and a guided-vision protocol to dissect where perception vs. reasoning bottlenecks lie, finding that visual features are easier for some models than abstract algorithmic reasoning, and that improvements from guided context are partial. Overall, the work highlights significant challenges in integrating visual understanding with complex algorithmic reasoning and provides a scalable dataset and methodology to push future research in multimodal reasoning systems.

Abstract

This paper introduces the novel task of multimodal puzzle solving, framed within the context of visual question-answering. We present a new dataset, AlgoPuzzleVQA designed to challenge and evaluate the capabilities of multimodal language models in solving algorithmic puzzles that necessitate both visual understanding, language understanding, and complex algorithmic reasoning. We create the puzzles to encompass a diverse array of mathematical and algorithmic topics such as boolean logic, combinatorics, graph theory, optimization, search, etc., aiming to evaluate the gap between visual data interpretation and algorithmic problem-solving skills. The dataset is generated automatically from code authored by humans. All our puzzles have exact solutions that can be found from the algorithm without tedious human calculations. It ensures that our dataset can be scaled up arbitrarily in terms of reasoning complexity and dataset size. Our investigation reveals that large language models (LLMs) such as GPT4V and Gemini exhibit limited performance in puzzle-solving tasks. We find that their performance is near random in a multi-choice question-answering setup for a significant number of puzzles. The findings emphasize the challenges of integrating visual, language, and algorithmic knowledge for solving complex reasoning problems.

Are Language Models Puzzle Prodigies? Algorithmic Puzzles Unveil Serious Challenges in Multimodal Reasoning

TL;DR

This paper introduces AlgoPuzzleVQA, a large-scale multimodal puzzle dataset designed to probe vision-language-algorithmic reasoning through diverse, automatically generated puzzles with exact solutions. It evaluates multiple state-of-the-art multimodal LLMs under zero-shot and guided prompting regimes, revealing that performance is frequently near random, with the calendar-oriented tasks showing the strongest gains (e.g., up to ~57% in some setups). The authors perform ontological analyses and a guided-vision protocol to dissect where perception vs. reasoning bottlenecks lie, finding that visual features are easier for some models than abstract algorithmic reasoning, and that improvements from guided context are partial. Overall, the work highlights significant challenges in integrating visual understanding with complex algorithmic reasoning and provides a scalable dataset and methodology to push future research in multimodal reasoning systems.

Abstract

This paper introduces the novel task of multimodal puzzle solving, framed within the context of visual question-answering. We present a new dataset, AlgoPuzzleVQA designed to challenge and evaluate the capabilities of multimodal language models in solving algorithmic puzzles that necessitate both visual understanding, language understanding, and complex algorithmic reasoning. We create the puzzles to encompass a diverse array of mathematical and algorithmic topics such as boolean logic, combinatorics, graph theory, optimization, search, etc., aiming to evaluate the gap between visual data interpretation and algorithmic problem-solving skills. The dataset is generated automatically from code authored by humans. All our puzzles have exact solutions that can be found from the algorithm without tedious human calculations. It ensures that our dataset can be scaled up arbitrarily in terms of reasoning complexity and dataset size. Our investigation reveals that large language models (LLMs) such as GPT4V and Gemini exhibit limited performance in puzzle-solving tasks. We find that their performance is near random in a multi-choice question-answering setup for a significant number of puzzles. The findings emphasize the challenges of integrating visual, language, and algorithmic knowledge for solving complex reasoning problems.
Paper Structure (28 sections, 16 figures, 2 tables)

This paper contains 28 sections, 16 figures, 2 tables.

Figures (16)

  • Figure 1: Question: The checkerboard shown in the image was originally of 6 * 9 in dimension having a total of 54 squares. It uses two colours of squares, one light yellow and one dark yellow, in a chequered pattern. Two of the squares have been removed from the board in the position of the white coloured cells, as shown in the image. You have 26 dominoes of size 2 * 1. You can use them as is or you can rotate them to use as a 1 * 2 domino. Is it possible to place all the 26 dominoes in the checkerboard to exactly cover all the remaining 52 squares? Answer Yes or No. Gold Answer: Yes
  • Figure 2: Question: The image shows the calendar of a month of a particular non-leap year. Which day of the week was on March 1 of that year? Gold Answer: Friday
  • Figure 3: Question: Alice has 12 segments of chains of different lengths as shown in the image. The total length of all the segments combined is 32 pieces. She has a saw machine with which a closed piece can be cut opened. She also has a welding machine with which an open piece can be closed. Each cut takes 5 minutes and each welding takes 2 minutes. Initially, she has 3 segments each with 1 open piece as shown in the image. All the other pieces are closed. She now wants to make the longest possible necklace using all the available 32 pieces. Each piece in the necklace would be connected to exactly two other pieces. This would require cutting open some pieces and then joining all the resulting segments together. What is the minimum time in which she can create the necklace? Gold Answer: 34
  • Figure 4: Question: Alexis came to an event 3 minutes ago. The current time is shown on the clock. The clock is a standard analog clock without the seconds hand. What was the time when Alexis came to the event? Gold Answer: 9:19
  • Figure 5: Question: A 5 * 4 board consists of 20 different coloured tiles. A random state of the board is shown in (A). The ideal state of the board is shown in (B). A swap consists of selecting any two tiles in the board and switching their positions. What is the minimum number of swaps required to restore the ideal state of the board from (A)? Gold Answer: 4
  • ...and 11 more figures