Table of Contents
Fetching ...

PuzzleBench: A Fully Dynamic Evaluation Framework for Large Multimodal Models on Puzzle Solving

Zeyu Zhang, Zijian Chen, Zicheng Zhang, Yuze Sun, Yuan Tian, Ziheng Jia, Chunyi Li, Xiaohong Liu, Xiongkuo Min, Guangtao Zhai

TL;DR

The paper tackles data contamination and stagnation in LMM evaluation by introducing a fully dynamic multimodal framework, OVPG, that automatically generates fresh, uniquely solvable VQA instances. It operationalizes OVPG through PuzzleBench, a dynamic benchmark with 11,840 samples across six puzzle tasks spanning visual recognition, logical reasoning, and context understanding. Extensive experiments across 14 LMMs demonstrate improvements in evaluation robustness via dynamic data while exposing substantial gaps in current models' fine-grained perception and reasoning. The framework also showcases the ability to refresh existing static benchmarks, enabling scalable, up-to-date assessment aligned with rapidly evolving LMM capabilities.

Abstract

Large Multimodal Models (LMMs) have demonstrated impressive capabilities across a wide range of multimodal tasks, achieving ever-increasing performance on various evaluation benchmarks. However, existing benchmarks are typically static and often overlap with pre-training datasets, leading to fixed complexity constraints and substantial data contamination issues. Meanwhile, manually annotated datasets are labor-intensive, time-consuming, and subject to human bias and inconsistency, leading to reliability and reproducibility issues. To address these problems, we propose a fully dynamic multimodal evaluation framework, named Open-ended Visual Puzzle Generation (OVPG), which aims to generate fresh, diverse, and verifiable evaluation data automatically in puzzle-solving tasks. Specifically, the OVPG pipeline consists of a raw material sampling module, a visual content generation module, and a puzzle rule design module, which ensures that each evaluation instance is primitive, highly randomized, and uniquely solvable, enabling continual adaptation to the evolving capabilities of LMMs. Built upon OVPG, we construct PuzzleBench, a dynamic and scalable benchmark comprising 11,840 VQA samples. It features six carefully designed puzzle tasks targeting three core LMM competencies, visual recognition, logical reasoning, and context understanding. PuzzleBench differs from static benchmarks that quickly become outdated. It enables ongoing dataset refreshing through OVPG and a rich set of open-ended puzzle designs, allowing seamless adaptation to the evolving capabilities of LMMs.

PuzzleBench: A Fully Dynamic Evaluation Framework for Large Multimodal Models on Puzzle Solving

TL;DR

The paper tackles data contamination and stagnation in LMM evaluation by introducing a fully dynamic multimodal framework, OVPG, that automatically generates fresh, uniquely solvable VQA instances. It operationalizes OVPG through PuzzleBench, a dynamic benchmark with 11,840 samples across six puzzle tasks spanning visual recognition, logical reasoning, and context understanding. Extensive experiments across 14 LMMs demonstrate improvements in evaluation robustness via dynamic data while exposing substantial gaps in current models' fine-grained perception and reasoning. The framework also showcases the ability to refresh existing static benchmarks, enabling scalable, up-to-date assessment aligned with rapidly evolving LMM capabilities.

Abstract

Large Multimodal Models (LMMs) have demonstrated impressive capabilities across a wide range of multimodal tasks, achieving ever-increasing performance on various evaluation benchmarks. However, existing benchmarks are typically static and often overlap with pre-training datasets, leading to fixed complexity constraints and substantial data contamination issues. Meanwhile, manually annotated datasets are labor-intensive, time-consuming, and subject to human bias and inconsistency, leading to reliability and reproducibility issues. To address these problems, we propose a fully dynamic multimodal evaluation framework, named Open-ended Visual Puzzle Generation (OVPG), which aims to generate fresh, diverse, and verifiable evaluation data automatically in puzzle-solving tasks. Specifically, the OVPG pipeline consists of a raw material sampling module, a visual content generation module, and a puzzle rule design module, which ensures that each evaluation instance is primitive, highly randomized, and uniquely solvable, enabling continual adaptation to the evolving capabilities of LMMs. Built upon OVPG, we construct PuzzleBench, a dynamic and scalable benchmark comprising 11,840 VQA samples. It features six carefully designed puzzle tasks targeting three core LMM competencies, visual recognition, logical reasoning, and context understanding. PuzzleBench differs from static benchmarks that quickly become outdated. It enables ongoing dataset refreshing through OVPG and a rich set of open-ended puzzle designs, allowing seamless adaptation to the evolving capabilities of LMMs.

Paper Structure

This paper contains 24 sections, 4 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: (a) Data contamination: existing benchmarks often overlap with training data, compromising evaluation reliability. (b) Static and manually annotated benchmarks: fixed data distributions and human bias limit adaptability and scalability. (c) Semi-dynamic approaches: editing existing data offers limited novelty due to the need to preserve original answer consistency.
  • Figure 2: Illustration of the proposed fully dynamic multimodal evaluation framework and the design pipeline of a specific puzzle task. (a) The Open-ended Visual Puzzle Generation (OVPG) framework. (b) The pipeline for generating Icon Connect VQA samples.
  • Figure 3: Accuracy of 4 LMMs across grid sizes [3, 9] on four grid-based puzzle tasks. All models exhibit a general decline in performance as the grid size increases. (a) InternVL2.5-8B, (b) InternVL2.5-78B, (c) GPT-4o, (d) Gemini-2.0-Flash.
  • Figure 4: Accuracy of the Qwen series across four datasets: Jigsaw-Origin-1k, Jigsaw-AIGI-5k, Difference Hunt-Origin-1k, and Difference Hunt-AIGI-5k.
  • Figure 5:
  • ...and 2 more figures