Table of Contents
Fetching ...

SKYLENAGE Technical Report: Mathematical Reasoning and Contest-Innovation Benchmarks for Multi-Level Math Evaluation

Hu Wei, Ze Xu, Boyu Yang, Linlin Miao, Weiqi Zhai, Yihan Li, Zixuan Li, Zhijun Wang, Boya Wang, Jianwei Yu, Jialing Yuan, Xiaoyue Zhang, Cheng He, Minglei Chen, Zifan Zhang, Qianhui Li, Wei Wang, Xiang Xu

TL;DR

SKYLENAGE develops two hard, metadata-rich benchmarks to evaluate mathematical reasoning across structured and contest-style tasks. It introduces SKYLENAGE-ReasoningMATH for structure-first reasoning with per-item features and SKYLENAGE-MATH for grade-spanning contest problems, evaluated under a unified protocol across 15 models. The results reveal stable leader–mid–tail separations, fragmentation by subject, and strong alignment between long-form reasoning and contest performance, with top scores of 81% on ReasoningMATH and 44% on SKYLENAGE-MATH. The work provides a robust reference for future research, enabling fine-grained diagnostics and prospective ensembles to improve mathematical reasoning in LLMs.

Abstract

Large language models (LLMs) now perform strongly on many public math suites, yet frontier separation within mathematics increasingly suffers from ceiling effects. We present two complementary benchmarks: SKYLENAGE-ReasoningMATH, a 100-item, structure-aware diagnostic set with per-item metadata on length, numeric density, and symbolic complexity; and SKYLENAGE-MATH, a 150-item contest-style suite spanning four stages from high school to doctoral under a seven-subject taxonomy. We evaluate fifteen contemporary LLM variants under a single setup and analyze subject x model and grade x model performance. On the contest suite, the strongest model reaches 44% while the runner-up reaches 37%; accuracy declines from high school to doctoral, and top systems exhibit a doctoral-to-high-school retention near 79%. On the reasoning set, the best model attains 81% overall, and hardest-slice results reveal clear robustness gaps between leaders and the mid-tier. In summary, we release SKYLENAGE-ReasoningMATH and report aggregate results for SKYLENAGE-MATH; together, SKYLENAGE provides a hard, reasoning-centered and broadly covering math benchmark with calibrated difficulty and rich metadata, serving as a reference benchmark for future evaluations of mathematical reasoning.

SKYLENAGE Technical Report: Mathematical Reasoning and Contest-Innovation Benchmarks for Multi-Level Math Evaluation

TL;DR

SKYLENAGE develops two hard, metadata-rich benchmarks to evaluate mathematical reasoning across structured and contest-style tasks. It introduces SKYLENAGE-ReasoningMATH for structure-first reasoning with per-item features and SKYLENAGE-MATH for grade-spanning contest problems, evaluated under a unified protocol across 15 models. The results reveal stable leader–mid–tail separations, fragmentation by subject, and strong alignment between long-form reasoning and contest performance, with top scores of 81% on ReasoningMATH and 44% on SKYLENAGE-MATH. The work provides a robust reference for future research, enabling fine-grained diagnostics and prospective ensembles to improve mathematical reasoning in LLMs.

Abstract

Large language models (LLMs) now perform strongly on many public math suites, yet frontier separation within mathematics increasingly suffers from ceiling effects. We present two complementary benchmarks: SKYLENAGE-ReasoningMATH, a 100-item, structure-aware diagnostic set with per-item metadata on length, numeric density, and symbolic complexity; and SKYLENAGE-MATH, a 150-item contest-style suite spanning four stages from high school to doctoral under a seven-subject taxonomy. We evaluate fifteen contemporary LLM variants under a single setup and analyze subject x model and grade x model performance. On the contest suite, the strongest model reaches 44% while the runner-up reaches 37%; accuracy declines from high school to doctoral, and top systems exhibit a doctoral-to-high-school retention near 79%. On the reasoning set, the best model attains 81% overall, and hardest-slice results reveal clear robustness gaps between leaders and the mid-tier. In summary, we release SKYLENAGE-ReasoningMATH and report aggregate results for SKYLENAGE-MATH; together, SKYLENAGE provides a hard, reasoning-centered and broadly covering math benchmark with calibrated difficulty and rich metadata, serving as a reference benchmark for future evaluations of mathematical reasoning.

Paper Structure

This paper contains 66 sections, 7 equations, 13 figures, 6 tables.

Figures (13)

  • Figure 1: Champion heatmap across benchmarks (transposed). Rows are benchmarks and columns are models. Each cell shows the accuracy; stars mark the per-benchmark champion (ties allowed).
  • Figure 1: All models: per-model radar grid (normalized). Row-wise min–max profiles reveal “roundness’’ (balanced) vs. spikes (specialization). Most models spike on MATH-500/AIME and show dents on HLE.
  • Figure 2: SKYLENAGE-ReasoningMATH construction pipeline. Our construction pipeline begins with a three-source intake—human authoring, rule-based generation, and structure-preserving rewrites—followed by multi-pass anti-contamination checks at the string, semantic, and template levels. We then perform style and format normalization, carry out bilingualization to ensure parity across languages, and add minimal process-hook annotations to enable step checks. Quality control is conducted with solver and simulator validation, after which we run a small pilot for difficulty calibration. Finally, we freeze the set for release.
  • Figure 2: Calibration to HLE (per-model scatter). Each panel regresses a target benchmark $y$ on HLE $x$. Dotted line: $y{=}x$. Solid line: OLS fit $y{=}ax{+}b$. Pearson $r$ measures agreement in ordering.
  • Figure 3: Reasoning-100 overview. Left: overall accuracy (sorted, %). Right: accuracy on the hardest quintile (Q5). GPT-5-20250807 reaches 81%, Qwen3-235B-A22B-2507 follows closely at 79%, and Grok-4-0709 at 75%. Against the tail, the margin is +44.6% vs. GLM-4.5 (56%), +80.0% vs. Llama 4 Maverick (45%), and +92.9% vs. Ernie-4.5-424B-A47B (42%). Top-5 overall (descending): GPT-5-20250807 (81), Qwen3-235B-A22B-2507 (79), Grok-4-0709 (75), GPT-oss-120b (69), Gemini2.5-Pro-0617 (69). On the hardest quintile, GPT-5-Chat-0807 leads at 35%; GPT-5-20250807 and Qwen3-235B-A22B-2507 follow at 30%.
  • ...and 8 more figures