Table of Contents
Fetching ...

VGRP-Bench: Visual Grid Reasoning Puzzle Benchmark for Large Vision-Language Models

Yufan Ren, Konstantinos Tertikas, Shalini Maiti, Junlin Han, Tong Zhang, Sabine Süsstrunk, Filippos Kokkinos

TL;DR

VGRP-Bench targets a critical gap in evaluating LVLMs on structured visual puzzles. It offers a large, customizable grid-based puzzle benchmark with $20$ puzzles across multiple difficulty levels and a taxonomy of rule capabilities, enabling fine-grained assessment of perception, rule adherence, and reasoning. The authors study both off-the-shelf LVLMs and reasoning-focused models, and propose two post-training strategies—Solution SFT and Reasoning SFT—to improve puzzle solving, while also examining generalization to unseen puzzles. Results show substantial challenges for current LVLMs, with post-training improving on trained instances but limited cross-puzzle generalization, underscoring the need for further research into robust, real-world problem solving abilities. VGRP-Bench is released publicly to catalyze progress in multimodal reasoning for complex tasks.

Abstract

Large Vision-Language Models (LVLMs) struggle with puzzles, which require precise perception, rule comprehension, and logical reasoning. Assessing and enhancing their performance in this domain is crucial, as it reflects their ability to engage in structured reasoning - an essential skill for real-world problem-solving. However, existing benchmarks primarily evaluate pre-trained models without additional training or fine-tuning, often lack a dedicated focus on reasoning, and fail to establish a systematic evaluation framework. To address these limitations, we introduce VGRP-Bench, a Visual Grid Reasoning Puzzle Benchmark featuring 20 diverse puzzles. VGRP-Bench spans multiple difficulty levels, and includes extensive experiments not only on existing chat LVLMs (e.g., GPT-4o), but also on reasoning LVLMs (e.g., Gemini-Thinking). Our results reveal that even the state-of-the-art LVLMs struggle with these puzzles, highlighting fundamental limitations in their puzzle-solving capabilities. Most importantly, through systematic experiments, we identify and analyze key factors influencing LVLMs' puzzle-solving performance, including the number of clues, grid size, and rule complexity. Furthermore, we explore two Supervised Fine-Tuning (SFT) strategies that can be used in post-training: SFT on solutions (S-SFT) and SFT on synthetic reasoning processes (R-SFT). While both methods significantly improve performance on trained puzzles, they exhibit limited generalization to unseen ones. We will release VGRP-Bench to facilitate further research on LVLMs for complex, real-world problem-solving. Project page: https://yufan-ren.com/subpage/VGRP-Bench/.

VGRP-Bench: Visual Grid Reasoning Puzzle Benchmark for Large Vision-Language Models

TL;DR

VGRP-Bench targets a critical gap in evaluating LVLMs on structured visual puzzles. It offers a large, customizable grid-based puzzle benchmark with puzzles across multiple difficulty levels and a taxonomy of rule capabilities, enabling fine-grained assessment of perception, rule adherence, and reasoning. The authors study both off-the-shelf LVLMs and reasoning-focused models, and propose two post-training strategies—Solution SFT and Reasoning SFT—to improve puzzle solving, while also examining generalization to unseen puzzles. Results show substantial challenges for current LVLMs, with post-training improving on trained instances but limited cross-puzzle generalization, underscoring the need for further research into robust, real-world problem solving abilities. VGRP-Bench is released publicly to catalyze progress in multimodal reasoning for complex tasks.

Abstract

Large Vision-Language Models (LVLMs) struggle with puzzles, which require precise perception, rule comprehension, and logical reasoning. Assessing and enhancing their performance in this domain is crucial, as it reflects their ability to engage in structured reasoning - an essential skill for real-world problem-solving. However, existing benchmarks primarily evaluate pre-trained models without additional training or fine-tuning, often lack a dedicated focus on reasoning, and fail to establish a systematic evaluation framework. To address these limitations, we introduce VGRP-Bench, a Visual Grid Reasoning Puzzle Benchmark featuring 20 diverse puzzles. VGRP-Bench spans multiple difficulty levels, and includes extensive experiments not only on existing chat LVLMs (e.g., GPT-4o), but also on reasoning LVLMs (e.g., Gemini-Thinking). Our results reveal that even the state-of-the-art LVLMs struggle with these puzzles, highlighting fundamental limitations in their puzzle-solving capabilities. Most importantly, through systematic experiments, we identify and analyze key factors influencing LVLMs' puzzle-solving performance, including the number of clues, grid size, and rule complexity. Furthermore, we explore two Supervised Fine-Tuning (SFT) strategies that can be used in post-training: SFT on solutions (S-SFT) and SFT on synthetic reasoning processes (R-SFT). While both methods significantly improve performance on trained puzzles, they exhibit limited generalization to unseen ones. We will release VGRP-Bench to facilitate further research on LVLMs for complex, real-world problem-solving. Project page: https://yufan-ren.com/subpage/VGRP-Bench/.

Paper Structure

This paper contains 24 sections, 4 equations, 30 figures, 9 tables.

Figures (30)

  • Figure 1: Benchmark Overview. (a) We present a benchmark for Large Vision-Language Models (LVLMs) consisting of 20 diverse visual grid reasoning puzzles (see supplementary material for complete table of per-puzzle examples and descriptions). (b) We evaluate state-of-the-art LVLMs, including closed-source models such as GPT-4o chatgpt_4o and Gemini team2023gemini, open-source models like Llama 3.2 dubey2024llama, and recently released reasoning models such as Gemini-Thinking, on various aspects, including perception, overall puzzle-solving, and cell-level rule-following. Additionally, to explore potential approaches for improving LVLMs’ puzzle-solving abilities, we examine post-training techniques, including (c) Solution Supervised Fine-Tuning (S-SFT) and (d) Reasoning Supervised Fine-Tuning (R-SFT), where we train on thought trajectories of a predefined solver. (Best viewed on a screen when zoomed-in)
  • Figure 2: Result Summary on Easy 2 Level. Puzzle-solving rate of state-of-the-art chat LVLMs on easy-level puzzles associated with each rule. Please refer to the experiment section for detailed result analysis. Note that this plot's score ranges from 0 to 45%, instead of 100%. (Best viewed on a screen when zoomed in)
  • Figure 3: Benchmark Games: Primitives and Sample Questions. we systematically define puzzle primitives, including conditions, constraints, variables, and states, to establish a unified framework for inference and evaluation (left). This benchmark includes tasks designed to evaluate the reasoning, rule-following, and perception capabilities of state-of-the-art LVLMs. (Best viewed on a screen when zoomed in)
  • Figure 4: Diverse Rules and Visual Patterns in VGRP-Bench. Our benchmark includes a diverse set of rules, such as counting and mathematical calculations, and also exhibits diversity in visual patterns, encompassing text, numerical values, and objects such as trees. We highlight puzzles that are easy or difficult to convert into text.
  • Figure 5: Off-the-Shelf LVLMs on Level-Easy 2 with CoT. We report both correct perception rate and puzzle-solving rate evaluations with closed-source / open-source and reasoning / chat models. Please refer to supplementary for additional evaluations such as finer granularity evaluations and other difficulty levels, e.g., medium0 and hard-2. (Puzzle-solving in hatched bars and best viewed on a screen when zoomed in)
  • ...and 25 more figures