Table of Contents
Fetching ...

Beyond Final Answers: CRYSTAL Benchmark for Transparent Multimodal Reasoning Evaluation

Wayner Barrios, SouYoung Jin

Abstract

We introduce CRYSTAL (Clear Reasoning via Yielded Steps, Traceability, and Logic), a diagnostic benchmark with 6,372 instances that evaluates multimodal reasoning through verifiable intermediate steps. We propose two complementary metrics: Match F1, which scores step-level precision and recall via semantic similarity matching, and Ordered Match F1, which further penalizes disordered reasoning chains. References are constructed through a Delphi-inspired pipeline in which four independent MLLMs generate trajectories, which are then aggregated via semantic clustering and validated through human quality gates. Evaluation of 20 MLLMs, including commercial frontier systems not used during benchmark construction, reveals systematic failures that are invisible to answer accuracy: universal cherry-picking (precision far exceeds recall), non-monotonic scaling trade-offs, and disordered reasoning in which no competitive model preserves more than 60% of matched steps in the correct order. Beyond evaluation, we propose the Causal Process Reward (CPR), a multiplicative reward that couples answer correctness with step-level alignment, and CPR-Curriculum, which progressively increases reasoning difficulty during training. CPR-Curriculum achieves a 32% improvement in Match F1 via GRPO where additive reward strategies fail, improving reasoning without manual step annotation.

Beyond Final Answers: CRYSTAL Benchmark for Transparent Multimodal Reasoning Evaluation

Abstract

We introduce CRYSTAL (Clear Reasoning via Yielded Steps, Traceability, and Logic), a diagnostic benchmark with 6,372 instances that evaluates multimodal reasoning through verifiable intermediate steps. We propose two complementary metrics: Match F1, which scores step-level precision and recall via semantic similarity matching, and Ordered Match F1, which further penalizes disordered reasoning chains. References are constructed through a Delphi-inspired pipeline in which four independent MLLMs generate trajectories, which are then aggregated via semantic clustering and validated through human quality gates. Evaluation of 20 MLLMs, including commercial frontier systems not used during benchmark construction, reveals systematic failures that are invisible to answer accuracy: universal cherry-picking (precision far exceeds recall), non-monotonic scaling trade-offs, and disordered reasoning in which no competitive model preserves more than 60% of matched steps in the correct order. Beyond evaluation, we propose the Causal Process Reward (CPR), a multiplicative reward that couples answer correctness with step-level alignment, and CPR-Curriculum, which progressively increases reasoning difficulty during training. CPR-Curriculum achieves a 32% improvement in Match F1 via GRPO where additive reward strategies fail, improving reasoning without manual step annotation.
Paper Structure (39 sections, 12 equations, 40 figures, 8 tables)

This paper contains 39 sections, 12 equations, 40 figures, 8 tables.

Figures (40)

  • Figure 1: The lucky guess problem. LLaVA-v1.6-7B answers correctly (C) but contradicts itself by claiming the middle console is larger while selecting it as smallest. Previous benchmarks score 100%; CRYSTAL compares the model's predicted steps against reference reasoning steps via Match F1 (0.15), exposing flawed reasoning.
  • Figure 2: CRYSTAL spans diverse multimodal reasoning scenarios. Three representative examples from different source benchmarks: (Left) RealWorldQA tests spatial understanding; (Middle) MMVP requires fine-grained visual perception; (Right) ScienceQA demands multi-hop logical reasoning. Numbers in parentheses indicate the total number of reference reasoning steps per example.
  • Figure 3: Ablation study: Encoder and threshold comparison. We evaluate 4 sentence encoders across 5 thresholds, averaged over all models. (a) all-distilroberta-v1 consistently achieves highest Match F1, with 4.9pp gain at $\tau=0.35$. (b--c) Higher thresholds decrease precision and recall for most encoders, while DistilRoBERTa remains stable across the full threshold range. We select $\tau=0.35$ as the optimal operating point.
  • Figure 4: Order metric comparison on a representative subset of 6 models. Both metrics rise for weak models generating few steps, but LIS ratio provides wider inter-group discrimination (0.56--0.85 vs. 0.58--0.81), clearly exposing trivially ordered few-step outputs.
  • Figure 5: Training dynamics.Left: Match F1. Composite oscillates; Answer-Only stays flat; CPR variants improve monotonically. Right: Accuracy. Composite collapses at step 600. Composite and Answer-Only training was halted at step 1,500 due to NaN gradient divergence; CPR variants train stably through 2,800 steps.
  • ...and 35 more figures