Table of Contents
Fetching ...

Tool Building as a Path to "Superintelligence"

David Koplow, Tomer Galanti, Tomaso Poggio

TL;DR

A benchmark to measure $\gamma$ on logical out-of-distribution inference is designed, finding that successful reasoning at scale is contingent upon precise tool calls, identifying tool design as a critical capability for LLMs to achieve general superintelligence through the Diligent Learner framework.

Abstract

The Diligent Learner framework suggests LLMs can achieve superintelligence via test-time search, provided a sufficient step-success probability $γ$. In this work, we design a benchmark to measure $γ$ on logical out-of-distribution inference. We construct a class of tasks involving GF(2) circuit reconstruction that grow more difficult with each reasoning step, and that are, from an information-theoretic standpoint, impossible to reliably solve unless the LLM carefully integrates all of the information provided. Our analysis demonstrates that while the $γ$ value for small LLMs declines superlinearly as depth increases, frontier models exhibit partial robustness on this task. Furthermore, we find that successful reasoning at scale is contingent upon precise tool calls, identifying tool design as a critical capability for LLMs to achieve general superintelligence through the Diligent Learner framework.

Tool Building as a Path to "Superintelligence"

TL;DR

A benchmark to measure on logical out-of-distribution inference is designed, finding that successful reasoning at scale is contingent upon precise tool calls, identifying tool design as a critical capability for LLMs to achieve general superintelligence through the Diligent Learner framework.

Abstract

The Diligent Learner framework suggests LLMs can achieve superintelligence via test-time search, provided a sufficient step-success probability . In this work, we design a benchmark to measure on logical out-of-distribution inference. We construct a class of tasks involving GF(2) circuit reconstruction that grow more difficult with each reasoning step, and that are, from an information-theoretic standpoint, impossible to reliably solve unless the LLM carefully integrates all of the information provided. Our analysis demonstrates that while the value for small LLMs declines superlinearly as depth increases, frontier models exhibit partial robustness on this task. Furthermore, we find that successful reasoning at scale is contingent upon precise tool calls, identifying tool design as a critical capability for LLMs to achieve general superintelligence through the Diligent Learner framework.
Paper Structure (18 sections, 5 theorems, 22 equations, 9 figures, 1 table)

This paper contains 18 sections, 5 theorems, 22 equations, 9 figures, 1 table.

Key Result

Lemma 4.0

Assume the instance distribution samples supports $S_1,\dots,S_n$ i.i.d. uniformly (with replacement) from $\{S\subseteq[p]:|S|=d-1\}$. Then for any $g<n$, conditioned on the revealed prefix $P_g$, the next support $S_{g+1}$ is uniform over $\{S\subseteq[p]:|S|=d-1\}$ and independent of $P_g$. Conse

Figures (9)

  • Figure 1: Diligent learner visualization from ShalevShwartzShashua2025Diligent.
  • Figure 2: Diligent Learner as validator-guided DFS. Good extensions occur with probability at least $\gamma$. On failure, the policy backtracks to the deepest correct prefix $\beta(h)$ and continues search.
  • Figure 3: Only history and data sustains reliable next-step prediction. Step success $\gamma_g$ (probability mass on the correct next monomial) versus depth $g$ for each estimator class. Curves show the mean over $2000$ generated circuits per depth, with shaded Jeffreys intervals. The diligent estimator $\mathcal{A}$ (history+data) maintains high $\gamma_g$ across depths, whereas $\mathcal{B}$ (data-only) and $\mathcal{D}$ (partial) frequently collapse toward zero mass, and $\mathcal{C}$ (history-only) remains at chance.
  • Figure 4: As both $g$ and $p$ increase, the probability of an estimator with imperfect information begins to collapse to zero. Only Estimator $\mathcal{A}$ is able to consistently produce the next monomial. The above heatmap was constructed through generating $200$ circuits for each combination of hyperparameters and computing the corresponding $\gamma_g$ for each $p$.
  • Figure 5: Small LLMs exhibit depth-induced collapse in next-step prediction. Step-success $\gamma_g$ (probability mass on the correct next monomial) versus circuit depth $g$ for Qwen3-2507 models under adversarial sampling ($p=12$, $d=4$). Despite the existence of a polynomial-time decoder at every step (Thm. \ref{['thm:recoverability_poly']}), all models degrade with depth: larger and "thinking" variants help at small $g$, but performance drops sharply at intermediate depths and approaches the trivial baseline $\gamma_{\mathrm{triv}}$, indicating limited ability to maintain the prefix-conditioned cancellation required for continued progress.
  • ...and 4 more figures

Theorems & Definitions (7)

  • Lemma 4.0: History-only is prior guessing
  • Lemma 4.0: Monomial firing probability at fixed Hamming weight
  • Lemma 4.0: Bayes masking given observed $(a,v)$
  • Theorem B.1: Poly-time recovery under fixed-weight payloads
  • proof
  • Corollary B.1: High-probability recovery with $K$ samples
  • proof