Table of Contents
Fetching ...

Aligning MLLM Benchmark With Human Preferences via Structural Equation Modeling

Shengwu. Xiong, Tianyu. Zou, Cong. Wang, Xuelong Li

TL;DR

The paper tackles the lack of interpretable, theoretically grounded benchmarks for Multimodal Large Language Models (MLLMs) by introducing a structural equation modeling (SEM) aligned framework and a Piaget-inspired three-layer capability hierarchy. It then delivers the Gold benchmark, constructed via SEM-driven task selection and a hierarchical Perception–Memory–Reasoning structure, with formative measurement linking observed task scores to latent abilities. Using a rigorous set of metrics, including Dimensional Diversity $D_{div}$, Task Contribution $TC$, and Indicator Validity $D_{valid}$, the authors demonstrate improved latent separability, reduced redundancy, and stronger alignment with human judgments (Pearson $r = 0.7359$) compared with existing benchmarks. While acknowledging ceiling effects on state-of-the-art models, the work sets a principled path for scalable, interpretable benchmarking and offers concrete plans for enriching the Gold benchmark to maintain diagnostic power as models advance.

Abstract

Evaluating multimodal large language models (MLLMs) is fundamentally challenged by the absence of structured, interpretable, and theoretically grounded benchmarks; current heuristically-grouped tasks have vague cognitive targets, overlapping abilities, redundant indicators, and weak diagnostic power. We therefore propose a structural-equation-modeling-aligned framework that quantifies internal validity, dimensional separability, and component contributions, and introduce a Piaget-inspired capability hierarchy that stratifies MLLM abilities into Perception, Memory, and Reasoning. Reorganizing existing tasks under this theory, we build the GOLD benchmark, whose experiments show superior interpretability, lower indicator redundancy, and clearer cognitive consistency than prior benchmarks.

Aligning MLLM Benchmark With Human Preferences via Structural Equation Modeling

TL;DR

The paper tackles the lack of interpretable, theoretically grounded benchmarks for Multimodal Large Language Models (MLLMs) by introducing a structural equation modeling (SEM) aligned framework and a Piaget-inspired three-layer capability hierarchy. It then delivers the Gold benchmark, constructed via SEM-driven task selection and a hierarchical Perception–Memory–Reasoning structure, with formative measurement linking observed task scores to latent abilities. Using a rigorous set of metrics, including Dimensional Diversity , Task Contribution , and Indicator Validity , the authors demonstrate improved latent separability, reduced redundancy, and stronger alignment with human judgments (Pearson ) compared with existing benchmarks. While acknowledging ceiling effects on state-of-the-art models, the work sets a principled path for scalable, interpretable benchmarking and offers concrete plans for enriching the Gold benchmark to maintain diagnostic power as models advance.

Abstract

Evaluating multimodal large language models (MLLMs) is fundamentally challenged by the absence of structured, interpretable, and theoretically grounded benchmarks; current heuristically-grouped tasks have vague cognitive targets, overlapping abilities, redundant indicators, and weak diagnostic power. We therefore propose a structural-equation-modeling-aligned framework that quantifies internal validity, dimensional separability, and component contributions, and introduce a Piaget-inspired capability hierarchy that stratifies MLLM abilities into Perception, Memory, and Reasoning. Reorganizing existing tasks under this theory, we build the GOLD benchmark, whose experiments show superior interpretability, lower indicator redundancy, and clearer cognitive consistency than prior benchmarks.

Paper Structure

This paper contains 23 sections, 14 equations, 9 figures, 9 tables, 1 algorithm.

Figures (9)

  • Figure 1: Brief illustrations of How to Evaluate Benchmark?
  • Figure 2: An architecture of SEM-based evaluation framework.
  • Figure 3: Structural analysis of MMBench by using SEM-based framework. MMBench shows stronger task contributions but suffers from redundancy.
  • Figure 4: Structural analysis of BLINK by using SEM-based framework. BLINK has lower redundancy but weaker task contributions.
  • Figure 5: HTMT heatmap of MMBench latent capabilities.
  • ...and 4 more figures