Table of Contents
Fetching ...

MM-OPERA: Benchmarking Open-ended Association Reasoning for Large Vision-Language Models

Zimeng Huang, Jinxin Ke, Xiaoxuan Fan, Yufeng Yang, Yang Liu, Liu Zhonghan, Zedi Wang, Junteng Dai, Haoyi Jiang, Yuyu Zhou, Keze Wang, Ziliang Chen

TL;DR

MM-OPERA introduces a psychometric-inspired benchmark to evaluate open-ended association reasoning in large vision-language models through two tasks, Remote-Item Association and In-Context Association. It combines free-form responses with reasoning-path traces and two LLM-based judges (Regular and Process-Reward) to assess answer quality and the cognitive process, respectively. Experimental results show that state-of-the-art LVLMs lag behind humans in both outcome quality and reasoning depth, with larger gaps in ICA and notable divergences between surface-level plausibility and knowledge-grounded, domain-specific inferences. The work provides a scalable dataset and evaluation framework that highlights essential challenges in cross-domain, cross-cultural associative reasoning and offers a path toward more robust, human-like multimodal intelligence.

Abstract

Large Vision-Language Models (LVLMs) have exhibited remarkable progress. However, deficiencies remain compared to human intelligence, such as hallucination and shallow pattern matching. In this work, we aim to evaluate a fundamental yet underexplored intelligence: association, a cornerstone of human cognition for creative thinking and knowledge integration. Current benchmarks, often limited to closed-ended tasks, fail to capture the complexity of open-ended association reasoning vital for real-world applications. To address this, we present MM-OPERA, a systematic benchmark with 11,497 instances across two open-ended tasks: Remote-Item Association (RIA) and In-Context Association (ICA), aligning association intelligence evaluation with human psychometric principles. It challenges LVLMs to resemble the spirit of divergent thinking and convergent associative reasoning through free-form responses and explicit reasoning paths. We deploy tailored LLM-as-a-Judge strategies to evaluate open-ended outputs, applying process-reward-informed judgment to dissect reasoning with precision. Extensive empirical studies on state-of-the-art LVLMs, including sensitivity analysis of task instances, validity analysis of LLM-as-a-Judge strategies, and diversity analysis across abilities, domains, languages, cultures, etc., provide a comprehensive and nuanced understanding of the limitations of current LVLMs in associative reasoning, paving the way for more human-like and general-purpose AI. The dataset and code are available at https://github.com/MM-OPERA-Bench/MM-OPERA.

MM-OPERA: Benchmarking Open-ended Association Reasoning for Large Vision-Language Models

TL;DR

MM-OPERA introduces a psychometric-inspired benchmark to evaluate open-ended association reasoning in large vision-language models through two tasks, Remote-Item Association and In-Context Association. It combines free-form responses with reasoning-path traces and two LLM-based judges (Regular and Process-Reward) to assess answer quality and the cognitive process, respectively. Experimental results show that state-of-the-art LVLMs lag behind humans in both outcome quality and reasoning depth, with larger gaps in ICA and notable divergences between surface-level plausibility and knowledge-grounded, domain-specific inferences. The work provides a scalable dataset and evaluation framework that highlights essential challenges in cross-domain, cross-cultural associative reasoning and offers a path toward more robust, human-like multimodal intelligence.

Abstract

Large Vision-Language Models (LVLMs) have exhibited remarkable progress. However, deficiencies remain compared to human intelligence, such as hallucination and shallow pattern matching. In this work, we aim to evaluate a fundamental yet underexplored intelligence: association, a cornerstone of human cognition for creative thinking and knowledge integration. Current benchmarks, often limited to closed-ended tasks, fail to capture the complexity of open-ended association reasoning vital for real-world applications. To address this, we present MM-OPERA, a systematic benchmark with 11,497 instances across two open-ended tasks: Remote-Item Association (RIA) and In-Context Association (ICA), aligning association intelligence evaluation with human psychometric principles. It challenges LVLMs to resemble the spirit of divergent thinking and convergent associative reasoning through free-form responses and explicit reasoning paths. We deploy tailored LLM-as-a-Judge strategies to evaluate open-ended outputs, applying process-reward-informed judgment to dissect reasoning with precision. Extensive empirical studies on state-of-the-art LVLMs, including sensitivity analysis of task instances, validity analysis of LLM-as-a-Judge strategies, and diversity analysis across abilities, domains, languages, cultures, etc., provide a comprehensive and nuanced understanding of the limitations of current LVLMs in associative reasoning, paving the way for more human-like and general-purpose AI. The dataset and code are available at https://github.com/MM-OPERA-Bench/MM-OPERA.

Paper Structure

This paper contains 42 sections, 32 figures, 7 tables.

Figures (32)

  • Figure 1: An overview of MM-OPERA. The RIA task challenges models to discover meaningful connections between unrelated elements, while the ICA task requires transferring relationship patterns from a context pair to a query item to generate an appropriate target. The reference answer represents just one possible valid response. The association reasoning paths are used to evaluate the coherence and depth of the step-by-step reasoning process.
  • Figure 2: Statistics of MM-OPERA. (a) Hierarchical ability taxonomy consists of 3 levels, refining perceptual and conceptual associations. We report each ability's frequency as a percentage of total label occurrences to better represent the dataset's distribution. (b) Three relationship types capturing diverse associative connections. (c) The number of hops in the association reasoning path, quantifying different associative reasoning complexity. (d) Different cultures, (e) 15 languages, and (f) 22 topic domains ensuring broad cultural, linguistic, and thematic diversity.
  • Figure 3: Fine-grained reasoning capability analysis of nine multimodal language models on RIA (left) and ICA tasks (right). From top to bottom: reasoning score distribution, holistic score distribution, reasoning path hop count distribution, Reasonableness distribution, Distinctiveness distribution, and Knowledgeability distribution. Each task includes 500 sampled questions, with results averaging evaluations from both GPT-4o and Deepseek-V3 judges.
  • Figure 4: Distribution of data sources.
  • Figure 5: Comparison of Model Performance in RIA and ICA across Different Conceptual (white background) and Perceptual (gray background) Dimensions. The radar charts illustrate the capabilities of various LVLMs in handling tasks related to relational perception, social insight, causal connections, abstract interpretation, and other cognitive functions. The left chart (RIA) exhibits greater variability in model performance, while the right chart (ICA) shows more consistent trends across models.
  • ...and 27 more figures