Classroom Final Exam: An Instructor-Tested Reasoning Benchmark

Chongyang Gao; Diji Yang; Shuyan Zhou; Xichen Yan; Luchuan Song; Shuo Li; Kezhen Chen

Classroom Final Exam: An Instructor-Tested Reasoning Benchmark

Chongyang Gao, Diji Yang, Shuyan Zhou, Xichen Yan, Luchuan Song, Shuo Li, Kezhen Chen

TL;DR

CFE-Bench introduces Classroom Final Exam, a text-and-multimodal STEM reasoning benchmark drawn from authentic course materials and enriched with a variable-based evaluation to avoid false positives in long-form answers. The dataset comprises $449$ problems (with $305$ text-only and $144$ multimodal items) across more than $20$ domains, and it provides expert-verified solution flows to enable fine-grained reasoning diagnostics. A core finding is that frontier models often solve individual reasoning steps correctly but struggle to derive and maintain correct intermediate states through long multi-step derivations, and their reasoning flows tend to be longer and less efficient than human solutions. The authors propose a diagnostic framework that decomposes problems into reasoning units, demonstrates substantial headroom for improvement in intermediate-state supervision, and advocates hybrid approaches and improved training objectives to foster more accurate and efficient STEM reasoning.

Abstract

We introduce \CFE{} (\textbf{C}lassroom \textbf{F}inal \textbf{E}xam), a multimodal benchmark for evaluating the reasoning capabilities of large language models across more than 20 STEM domains. \CFE{} is curated from repeatedly used, authentic university homework and exam problems, together with reference solutions provided by course instructors. \CFE{} presents a significant challenge even for frontier models: the newly released Gemini-3.1-pro-preview achieves an overall accuracy of 59.69\%, while the second-best model, Gemini-3-flash-preview, reaches 55.46\%, leaving considerable room for improvement. Beyond leaderboard results, we perform a diagnostic analysis by decomposing reference solutions into reasoning flows. We find that although frontier models can often answer intermediate sub-questions correctly, they struggle to reliably derive and maintain correct intermediate states throughout multi-step solutions. We further observe that model-generated solutions typically have more reasoning steps than those provided by the instructor, indicating suboptimal step efficiency and a higher risk of error accumulation. The data and code are available at https://github.com/Analogy-AI/CFE_Bench.

Classroom Final Exam: An Instructor-Tested Reasoning Benchmark

TL;DR

problems (with

text-only and

multimodal items) across more than

domains, and it provides expert-verified solution flows to enable fine-grained reasoning diagnostics. A core finding is that frontier models often solve individual reasoning steps correctly but struggle to derive and maintain correct intermediate states through long multi-step derivations, and their reasoning flows tend to be longer and less efficient than human solutions. The authors propose a diagnostic framework that decomposes problems into reasoning units, demonstrates substantial headroom for improvement in intermediate-state supervision, and advocates hybrid approaches and improved training objectives to foster more accurate and efficient STEM reasoning.

Abstract

Paper Structure (30 sections, 1 equation, 6 figures, 4 tables)

This paper contains 30 sections, 1 equation, 6 figures, 4 tables.

Introduction
Related Work
Disparities in Reasoning Capabilities
Reasoning Benchmarks
CFE Benchmark
Collection
Expert Annotation Protocol
Variable-Based Evaluation
Model Performance on CFE-Bench
Deconstructing the Frontier Model Performance Gap
Formalizing the Reasoning Flow
Q1: Unit Execution Ability
Setup.
Findings.
Q2: Reasoning Progression Capability
...and 15 more sections

Figures (6)

Figure 1: Representative examples from CFE-Bench and variable-based annotation. Top: example text-only and multimodal problems from CFE-Bench. Bottom: the structured annotation for the answer of the text-only example, including the variable name, type, semantic description, and ground-truth value.
Figure 2: Subject distribution of CFE-Bench by modality. We report the field breakdown for the text-only subset ($305$; left) and the multimodal subset ($144$; right).
Figure 3: Unit Execution accuracy for text and multimodal subsets.
Figure 4: Sample-level diagnostics for text subset. The red curve shows unit execution accuracy. The other curves show final-answer accuracy under unit conditioning: Reasoning Prefix, Reasoning Prefix (Questions Only), Single-Unit Injection, and Single-Unit Injection (Question Only). Notably, although all curves share the same $y$-axis scale, the red curve measures unit-level correctness, whereas the remaining curves measure final-answer correctness.
Figure 5: Sample-level diagnostics for multimodal subset.
...and 1 more figures

Classroom Final Exam: An Instructor-Tested Reasoning Benchmark

TL;DR

Abstract

Classroom Final Exam: An Instructor-Tested Reasoning Benchmark

Authors

TL;DR

Abstract

Table of Contents

Figures (6)