Escaping the Cognitive Well: Efficient Competition Math with Off-the-Shelf Models

Xingyu Dang; Rohit Agarwal; Rodrigo Porto; Anirudh Goyal; Liam H Fowl; Sanjeev Arora

Escaping the Cognitive Well: Efficient Competition Math with Off-the-Shelf Models

Xingyu Dang, Rohit Agarwal, Rodrigo Porto, Anirudh Goyal, Liam H Fowl, Sanjeev Arora

TL;DR

The paper presents a cost-efficient inference pipeline for IMO-style math using only off-the-shelf models. By identifying and mitigating Cognitive Plateau and Cognitive Well through conjecture extraction and context detachment, the method achieves state-of-the-art performance on IMO ProofBench Advanced at substantially lower cost than prior pipelines. It emphasizes a narrow parallelism strategy, explicit lemmas, and a global memory of verified conjectures to drive progress. The work demonstrates strong empirical results, analyzes grader behavior, and provides open prompts to reduce barriers for researchers and enthusiasts, while acknowledging limitations in grading reliability and latency.

Abstract

In the past year, custom and unreleased math reasoning models reached gold medal performance on the International Mathematical Olympiad (IMO). Similar performance was then reported using large-scale inference on publicly available models but at prohibitive costs (e.g., 3000 USD per problem). In this work, we present an inference pipeline that attains best-in-class performance on IMO-style math problems at an average inference cost orders of magnitude below competing methods while using only general-purpose off-the-shelf models. Our method relies on insights about grader failure in solver-grader pipelines, which we call the Cognitive Well (iterative refinement converging to a wrong solution that the solver as well as the pipeline's internal grader consider to be basically correct). Our pipeline addresses these failure modes through conjecture extraction, wherein candidate lemmas are isolated from generated solutions and independently verified alongside their negations in a fresh environment (context detachment). On IMO-ProofBench Advanced (PB-Adv), our pipeline achieves 67.1 percent performance using Gemini 3.0 Pro with an average cost per question of approximately 31 USD. At the time of evaluation, this represented the state-of-the-art on PB-Adv among both public and unreleased models, and more than doubles the success rate of the next best publicly accessible pipeline, all at a fraction of the cost.

Escaping the Cognitive Well: Efficient Competition Math with Off-the-Shelf Models

TL;DR

Abstract

Paper Structure (38 sections, 5 equations, 7 figures, 6 tables, 4 algorithms)

This paper contains 38 sections, 5 equations, 7 figures, 6 tables, 4 algorithms.

Introduction
Identifying Failure Modes
Misconception: "Grader as compass"
Contributions and Paper Organization
Method Overview
Pipeline Design Ideas
Pipeline Architecture
Results
Cost Comparison
Solver Prompt Evaluation
Grader Evaluation
Conjecture Removal: Analysis
Limitations
Conclusion
Algorithm Details
...and 23 more sections

Figures (7)

Figure 1: Results for our proposed method versus competing methods and models. Both expert graded and autograded results included (where possible). Our proposed method outperforms all open-source pipelines, while being significantly cheaper than these pipelines (cf. Figure \ref{['fig:pareto']}) and ranks second only to a custom and unreleased pipeline. Comparison results taken from luong2025towards where applicable, and from shao2025deepseekmath. Autograder results calculated using the resources published in luong2025towards.
Figure 2: A high level overview of our pipeline. A problem is processed by parallel solver--grader pairs (module A). When progress stalls, the Conjecture Extractor identifies candidate lemmas $C_i$ and their negations from current proof attempts. Each conjecture is resolved by an independent solver--grader module (also A), with proofs (for either the conjecture or its negation) fed back to enable further progress toward the final solution.
Figure 3: Pareto frontier of cost vs. performance measured using the autograder in luong2025towards. "Single" denotes that the output of a single run of the pipeline. "Combined" denotes our standard algorithm for PB-Adv which uses a further model call to choose from two parallel pipeline runs - cf. Section \ref{['subsec:pipeline_arch']} "Post Enhancement and Aggregation"). All costs calculated with January 2026 pricing.
Figure 4: Results for IMO Proof Bench (Basic) according to the IMO-Bench autograder. "Single" here denotes that the output of a single run of the pipeline was submitted (as opposed to our standard algorithm for PB-Adv which uses a further model call to choose from two parallel pipeline runs - cf. Section \ref{['subsec:pipeline_arch']} "Post Enhancement and Aggregation"). All comparison results taken from luong2025towards.
Figure 5: Study of benefits of conjecture extraction/resolution on the solution pipeline. We run (a) the pipeline without any conjecture steps and (b) with conjecture steps. For both runs, we equalize total budgets. We sample solutions for five phases throughout the runs. We then grade these runs using both the pipeline's internal grader and the external ProofAutoGrader. We see that with conjectures, the ProofAutoGrader scores monotonically increase, while those without conjectures may jump wildly or flatline. Upward spikes in conjecture-based solving correspond to proof/disproof of conjectures.
...and 2 more figures

Theorems & Definitions (6)

proof
proof
proof
proof
proof
proof

Escaping the Cognitive Well: Efficient Competition Math with Off-the-Shelf Models

TL;DR

Abstract

Escaping the Cognitive Well: Efficient Competition Math with Off-the-Shelf Models

Authors

TL;DR

Abstract

Table of Contents

Figures (7)

Theorems & Definitions (6)