Table of Contents
Fetching ...

Diverse Inference and Verification for Advanced Reasoning

Iddo Drori, Gaston Longhitano, Mao Mao, Seunghwan Hyun, Yuke Zhang, Sungjun Park, Zachary Meeks, Xin-Yu Zhang, Ben Segev, Howard Yong, Nakul Verma, Avi Shporer, Alon Amit, Madeleine Udell

TL;DR

This work proposes a diverse inference framework that ensembles multiple models and test-time strategies to tackle challenging reasoning tasks (IMO combinatorics, ARC puzzles, HLE). It couples test-time simulations, adaptive meta-learning, and perfect verifiers (Lean formalization for IMO, code execution for ARC) with imperfect verifiers (best-of-N for HLE) to boost answer accuracy. Empirical results show substantial gains: IMO combinatorics accuracy rises from 33.3% to 77.8%, ARC puzzles solved beyond hundreds of humans and beyond some high-compute models, and HLE performance improves with best-of-N sampling at the cost of compute. The contributions include a formal verification pipeline, autoformalization to Lean, interactive game representations, and public release plans, underscoring a robust, scalable, and reproducible approach to AI-assisted mathematical reasoning and verification.

Abstract

Reasoning LLMs such as OpenAI o1, o3 and DeepSeek R1 have made significant progress in mathematics and coding, yet find challenging advanced tasks such as International Mathematical Olympiad (IMO) combinatorics problems, Abstraction and Reasoning Corpus (ARC) puzzles, and Humanity's Last Exam (HLE) questions. We use a diverse inference approach that combines multiple models and methods at test time. We find that verifying mathematics and code problems, and rejection sampling on other problems is simple and effective. We automatically verify correctness of solutions to IMO problems by Lean, and ARC puzzles by code, and find that best-of-N effectively answers HLE questions. Our approach increases answer accuracy on IMO combinatorics problems from 33.3% to 77.8%, accuracy on HLE questions from 8% to 37%, and solves 80% of ARC puzzles that 948 humans could not and 26.5% of ARC puzzles that o3 high compute does not. Test-time simulations, reinforcement learning, and meta-learning with inference feedback improve generalization by adapting agent graph representations and varying prompts, code, and datasets. Our approach is reliable, robust, and scalable, and in the spirit of reproducible research, we will make it publicly available upon publication.

Diverse Inference and Verification for Advanced Reasoning

TL;DR

This work proposes a diverse inference framework that ensembles multiple models and test-time strategies to tackle challenging reasoning tasks (IMO combinatorics, ARC puzzles, HLE). It couples test-time simulations, adaptive meta-learning, and perfect verifiers (Lean formalization for IMO, code execution for ARC) with imperfect verifiers (best-of-N for HLE) to boost answer accuracy. Empirical results show substantial gains: IMO combinatorics accuracy rises from 33.3% to 77.8%, ARC puzzles solved beyond hundreds of humans and beyond some high-compute models, and HLE performance improves with best-of-N sampling at the cost of compute. The contributions include a formal verification pipeline, autoformalization to Lean, interactive game representations, and public release plans, underscoring a robust, scalable, and reproducible approach to AI-assisted mathematical reasoning and verification.

Abstract

Reasoning LLMs such as OpenAI o1, o3 and DeepSeek R1 have made significant progress in mathematics and coding, yet find challenging advanced tasks such as International Mathematical Olympiad (IMO) combinatorics problems, Abstraction and Reasoning Corpus (ARC) puzzles, and Humanity's Last Exam (HLE) questions. We use a diverse inference approach that combines multiple models and methods at test time. We find that verifying mathematics and code problems, and rejection sampling on other problems is simple and effective. We automatically verify correctness of solutions to IMO problems by Lean, and ARC puzzles by code, and find that best-of-N effectively answers HLE questions. Our approach increases answer accuracy on IMO combinatorics problems from 33.3% to 77.8%, accuracy on HLE questions from 8% to 37%, and solves 80% of ARC puzzles that 948 humans could not and 26.5% of ARC puzzles that o3 high compute does not. Test-time simulations, reinforcement learning, and meta-learning with inference feedback improve generalization by adapting agent graph representations and varying prompts, code, and datasets. Our approach is reliable, robust, and scalable, and in the spirit of reproducible research, we will make it publicly available upon publication.

Paper Structure

This paper contains 75 sections, 101 equations, 54 figures, 34 tables.

Figures (54)

  • Figure 1: IMO agent architecture.
  • Figure 2: Ablation over problems, methods, and models. Correct answers (in green) for each Mathematical Olympiad problem (column), method (row), and model (top panel o3-mini high, bottom panel o1). Problems are from the 2024 International Mathematical Olympiad (IMO), 2024 USA Mathematical Olympiad (USAMO), and 2023 IMO ShortList (IMOSL). All problems are non-contaminated by the underlying models since their knowledge cutoff dates is after the release of the solutions. The bottom row shows which problems are answered correctly by any of the different methods and their answer automatically verified. Numbers inside cells indicate running times in seconds. AG denotes the IMO agent whose details are in Appendices F-N. Additional results and evaluations are in Appendices C-E.
  • Figure 3: 2024 IMO contestant rank vs. total score. Our approach proves the fifth problem in combinatorics correctly with a score of 7/7 whereas the average human IMO participant score is 2.25/7 on this problem. This result tips performance to solving 5/6 problems correctly, with a rank of 5 and a score of 35/42.
  • Figure 4: ARC performance for different models and methods and human performance on evaluation dataset of 400 puzzles.
  • Figure 5: Zooming in on diversity performance of 16 models and methods on 400 ARC evalutaion puzzles.
  • ...and 49 more figures