MM-OPERA: Benchmarking Open-ended Association Reasoning for Large Vision-Language Models

Zimeng Huang; Jinxin Ke; Xiaoxuan Fan; Yufeng Yang; Yang Liu; Liu Zhonghan; Zedi Wang; Junteng Dai; Haoyi Jiang; Yuyu Zhou; Keze Wang; Ziliang Chen

MM-OPERA: Benchmarking Open-ended Association Reasoning for Large Vision-Language Models

Zimeng Huang, Jinxin Ke, Xiaoxuan Fan, Yufeng Yang, Yang Liu, Liu Zhonghan, Zedi Wang, Junteng Dai, Haoyi Jiang, Yuyu Zhou, Keze Wang, Ziliang Chen

TL;DR

MM-OPERA introduces a psychometric-inspired benchmark to evaluate open-ended association reasoning in large vision-language models through two tasks, Remote-Item Association and In-Context Association. It combines free-form responses with reasoning-path traces and two LLM-based judges (Regular and Process-Reward) to assess answer quality and the cognitive process, respectively. Experimental results show that state-of-the-art LVLMs lag behind humans in both outcome quality and reasoning depth, with larger gaps in ICA and notable divergences between surface-level plausibility and knowledge-grounded, domain-specific inferences. The work provides a scalable dataset and evaluation framework that highlights essential challenges in cross-domain, cross-cultural associative reasoning and offers a path toward more robust, human-like multimodal intelligence.

Abstract

Large Vision-Language Models (LVLMs) have exhibited remarkable progress. However, deficiencies remain compared to human intelligence, such as hallucination and shallow pattern matching. In this work, we aim to evaluate a fundamental yet underexplored intelligence: association, a cornerstone of human cognition for creative thinking and knowledge integration. Current benchmarks, often limited to closed-ended tasks, fail to capture the complexity of open-ended association reasoning vital for real-world applications. To address this, we present MM-OPERA, a systematic benchmark with 11,497 instances across two open-ended tasks: Remote-Item Association (RIA) and In-Context Association (ICA), aligning association intelligence evaluation with human psychometric principles. It challenges LVLMs to resemble the spirit of divergent thinking and convergent associative reasoning through free-form responses and explicit reasoning paths. We deploy tailored LLM-as-a-Judge strategies to evaluate open-ended outputs, applying process-reward-informed judgment to dissect reasoning with precision. Extensive empirical studies on state-of-the-art LVLMs, including sensitivity analysis of task instances, validity analysis of LLM-as-a-Judge strategies, and diversity analysis across abilities, domains, languages, cultures, etc., provide a comprehensive and nuanced understanding of the limitations of current LVLMs in associative reasoning, paving the way for more human-like and general-purpose AI. The dataset and code are available at https://github.com/MM-OPERA-Bench/MM-OPERA.

MM-OPERA: Benchmarking Open-ended Association Reasoning for Large Vision-Language Models

TL;DR

Abstract

MM-OPERA: Benchmarking Open-ended Association Reasoning for Large Vision-Language Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (32)