Evaluating Large Language Models on the 2026 Korean CSAT Mathematics Exam: Measuring Mathematical Ability in a Zero-Data-Leakage Setting
Goun Pyeon, Inbum Heo, Jeesu Jung, Taewook Hwang, Hyuk Namgoong, Hyein Seo, Yerim Han, Eunbin Kim, Hyeonseok Kang, Sangkeun Jung
TL;DR
<3-5 sentence high-level summary>This work introduces a contamination-free evaluation framework for measuring mathematical reasoning in Large Language Models (LLMs) using the 2026 Korean CSAT Mathematics exam. It digitizes all problems within two hours of release and evaluates 24 models across text, image, and text+image modalities with Korean and English prompts, while varying a reasoning-configuration space for GPT-5. Key findings show that perfect CSAT-level performance is mostly limited to the GPT-5 family under specific conditions, and that there are substantial time and cost trade-offs when increasing internal reasoning. The study contributes a fully leakage-free benchmark, a standardized data-digitization pipeline, and a practical, efficiency-conscious perspective on LLM evaluation that is applicable to high-stakes, real-world testing settings.
Abstract
This study systematically evaluated the mathematical reasoning capabilities of Large Language Models (LLMs) using the 2026 Korean College Scholastic Ability Test (CSAT) Mathematics section, ensuring a completely contamination-free evaluation environment. To address data leakage issues in existing benchmarks, we digitized all 46 questions (22 common and 24 elective) within two hours of the exam's public release, eliminating any possibility of inclusion in model training data. We conducted comprehensive evaluations of 24 state-of-the-art LLMs across varying input modalities (Text-only, Image-only, Text+Figure) and prompt languages (Korean, English). The GPT-5 family models achieved perfect scores (100 points) under a limited set of language-modality configurations, while Grok 4, Qwen 3 235B, and Gemini 2.5 pro also scored above 97 points. Notably, gpt-oss-20B achieved 95.7 points despite its relatively small size, demonstrating high cost-effectiveness. Problem-specific analysis revealed Calculus as the weakest domain with significant performance degradation on 4-point high-difficulty problems. Text input consistently outperformed image input, while prompt language effects varied by model scale. In reasoning enhancement experiments with GPT-5 series, increased reasoning intensity improved performance (82.6->100 points) but quadrupled token usage and drastically reduced efficiency, suggesting that models with minimal reasoning may be more practical. This research contributes: (1) implementation of a completely unexposed evaluation environment, (2) a standardized digitization pipeline that converts human-targeted exam materials into LLM-ready evaluation data, and (3) a practical evaluation perspective integrating performance, cost, and time considerations. Detailed results and model comparisons are available at the 2026 Korean CSAT LLM Evaluation Leaderboard; https://isoft.cnu.ac.kr/csat2026/
