Table of Contents
Fetching ...

STAR : Bridging Statistical and Agentic Reasoning for Large Model Performance Prediction

Xiaoxiao Wang, Chunxiao Li, Junying Wang, Yijin Guo, Zijian Chen, Chunyi Li, Xiaohong Liu, Zicheng Zhang, Guangtao Zhai

TL;DR

STAR tackles the evaluation bottleneck for large models by uniting retrieval-augmented statistical expectations with agentic EVT-guided reasoning to predict benchmark performance under extreme sparsity and pattern shifts. The framework embeds semantic knowledge from retrieval into Constrained PMF, producing $\hat{R}_{mn}$ with uncertainty, then refines it via intra-family and cross-model analyses and credibility-weighted adjustments to yield $\tilde{R}_{mn}$ with explanations. Empirical results on 285×28 OpenCompass-derived data show STAR achieving the best total score, with a 14.46% gain over the strongest statistical baseline under high sparsity and substantial gains under benchmark-side pattern shifts, while providing traceable reasoning. The work demonstrates practical, scalable, and interpretable model evaluation that reduces costly full benchmarks and supports rapid, credible decision-making.

Abstract

As comprehensive large model evaluation becomes prohibitively expensive, predicting model performance from limited observations has become essential. However, existing statistical methods struggle with pattern shifts, data sparsity, and lack of explanation, while pure LLM methods remain unreliable. We propose STAR, a framework that bridges data-driven STatistical expectations with knowledge-driven Agentic Reasoning. STAR leverages specialized retrievers to gather external knowledge and embeds semantic features into Constrained Probabilistic Matrix Factorization (CPMF) to generate statistical expectations with uncertainty. A reasoning module guided by Expectation Violation Theory (EVT) then refines predictions through intra-family analysis, cross-model comparison, and credibility-aware aggregation, producing adjustments with traceable explanations. Extensive experiments show that STAR consistently outperforms all baselines on both score-based and rank-based metrics, delivering a 14.46% gain in total score over the strongest statistical method under extreme sparsity, with only 1--2 observed scores per test model.

STAR : Bridging Statistical and Agentic Reasoning for Large Model Performance Prediction

TL;DR

STAR tackles the evaluation bottleneck for large models by uniting retrieval-augmented statistical expectations with agentic EVT-guided reasoning to predict benchmark performance under extreme sparsity and pattern shifts. The framework embeds semantic knowledge from retrieval into Constrained PMF, producing with uncertainty, then refines it via intra-family and cross-model analyses and credibility-weighted adjustments to yield with explanations. Empirical results on 285×28 OpenCompass-derived data show STAR achieving the best total score, with a 14.46% gain over the strongest statistical baseline under high sparsity and substantial gains under benchmark-side pattern shifts, while providing traceable reasoning. The work demonstrates practical, scalable, and interpretable model evaluation that reduces costly full benchmarks and supports rapid, credible decision-making.

Abstract

As comprehensive large model evaluation becomes prohibitively expensive, predicting model performance from limited observations has become essential. However, existing statistical methods struggle with pattern shifts, data sparsity, and lack of explanation, while pure LLM methods remain unreliable. We propose STAR, a framework that bridges data-driven STatistical expectations with knowledge-driven Agentic Reasoning. STAR leverages specialized retrievers to gather external knowledge and embeds semantic features into Constrained Probabilistic Matrix Factorization (CPMF) to generate statistical expectations with uncertainty. A reasoning module guided by Expectation Violation Theory (EVT) then refines predictions through intra-family analysis, cross-model comparison, and credibility-aware aggregation, producing adjustments with traceable explanations. Extensive experiments show that STAR consistently outperforms all baselines on both score-based and rank-based metrics, delivering a 14.46% gain in total score over the strongest statistical method under extreme sparsity, with only 1--2 observed scores per test model.
Paper Structure (46 sections, 18 equations, 15 figures, 18 tables)

This paper contains 46 sections, 18 equations, 15 figures, 18 tables.

Figures (15)

  • Figure 1: When predicting missing benchmark scores for new models, STAR combines statistical expectation with knowledge-driven reasoning to produce accurate and explainable predictions.
  • Figure 2: Overview of the STAR framework. Given a user query to predict model $M$'s performance on benchmark $B$, STAR integrates four modules: (a) Historical Memory maintains the observed score matrix and structured profiles for models and benchmarks; (b) Retrieval Augmentation collects external knowledge from google, huggingface and arxiv; (c) Statistical Expectation combines PMF latent factors with retrieval semantic features to generate initial expectations with uncertainty estimates; (d) EVT-guided Reasoning refines predictions through two-step analysis and credibility-aware aggregation, producing adjustments $\Delta$ with explanations.
  • Figure 3: Performance comparison across masking rates from 40% to 95%. Left: Total Score (Rank Avg $-$ Score Avg, higher is better). Middle: Rank-based Average (higher is better). Right: Score-based Average (lower is better). STAR consistently achieves the best performance across all metrics, with its advantage over statistical baselines widening as sparsity increases. Details in Appendix \ref{['app:sparsity_result']}
  • Figure 4: Illustration of evaluation settings. Left: Temporal split for sparsity experiments. Models released before January 2025 have fully observed scores (dark blue), while $P$% of scores for newer models are masked as prediction targets (light blue). Right: Pattern shift evaluation. Held-out models (bottom row, green) or benchmarks (rightmost column, cyan) are completely unseen during training, testing generalization to novel model characteristics or task categories.
  • Figure 5: Comparison of reasoning processes for predicting InternVL3-78B on MMBench_TEST_EN. Key evidence is highlighted in blue for supporting factors and red for limiting factors.
  • ...and 10 more figures