Table of Contents
Fetching ...

Can Complexity and Uncomputability Explain Intelligence? SuperARC: A Test for Artificial Super Intelligence Based on Recursive Compression

Alberto Hernández-Espinosa, Luan Ozelim, Felipe S. Abrahão, Hector Zenil

TL;DR

This work proposes SuperARC, a human-agnostic benchmarking framework grounded in Algorithmic Information Theory to assess AI across AGI/ASI frontiers by focusing on abstraction (compression) and prediction (inference). Leveraging CTM and BDM, the framework combines neurosymbolic methods with pattern-based approaches to quantify an AI model's ability to compress and generate executable models for sequences, revealing that frontier LLMs often rely on memorisation or pattern matching and can regress across generations. The results across next-digit tasks, free-form generation, and code-generation experiments show that algorithmic reasoning remains limited in current models, while neurosymbolic baselines can achieve high compression-based understanding, highlighting the need for integrating symbolic reasoning in AI development. The authors discuss open-ended evaluation, AID, and policy implications, advocating a shift toward algorithmic benchmarks to complement traditional human-centric assessments, and outlining practical steps for adoption, governance, and future research directions.

Abstract

We introduce an increasing-complexity, open-ended, and human-agnostic metric to evaluate foundational and frontier AI models in the context of Artificial General Intelligence (AGI) and Artificial Super Intelligence (ASI) claims. Unlike other tests that rely on human-centric questions and expected answers, or on pattern-matching methods, the test here introduced is grounded on fundamental mathematical areas of randomness and optimal inference. We argue that human-agnostic metrics based on the universal principles established by Algorithmic Information Theory (AIT) formally framing the concepts of model abstraction and prediction offer a powerful metrological framework. When applied to frontiers models, the leading LLMs outperform most others in multiple tasks, but they do not always do so with their latest model versions, which often regress and appear far from any global maximum or target estimated using the principles of AIT defining a Universal Intelligence (UAI) point and trend in the benchmarking. Conversely, a hybrid neuro-symbolic approach to UAI based on the same principles is shown to outperform frontier specialised prediction models in a simplified but relevant example related to compression-based model abstraction and sequence prediction. Finally, we prove and conclude that predictive power through arbitrary formal theories is directly proportional to compression over the algorithmic space, not the statistical space, and so further AI models' progress can only be achieved in combination with symbolic approaches that LLMs developers are adopting often without acknowledgement or realisation.

Can Complexity and Uncomputability Explain Intelligence? SuperARC: A Test for Artificial Super Intelligence Based on Recursive Compression

TL;DR

This work proposes SuperARC, a human-agnostic benchmarking framework grounded in Algorithmic Information Theory to assess AI across AGI/ASI frontiers by focusing on abstraction (compression) and prediction (inference). Leveraging CTM and BDM, the framework combines neurosymbolic methods with pattern-based approaches to quantify an AI model's ability to compress and generate executable models for sequences, revealing that frontier LLMs often rely on memorisation or pattern matching and can regress across generations. The results across next-digit tasks, free-form generation, and code-generation experiments show that algorithmic reasoning remains limited in current models, while neurosymbolic baselines can achieve high compression-based understanding, highlighting the need for integrating symbolic reasoning in AI development. The authors discuss open-ended evaluation, AID, and policy implications, advocating a shift toward algorithmic benchmarks to complement traditional human-centric assessments, and outlining practical steps for adoption, governance, and future research directions.

Abstract

We introduce an increasing-complexity, open-ended, and human-agnostic metric to evaluate foundational and frontier AI models in the context of Artificial General Intelligence (AGI) and Artificial Super Intelligence (ASI) claims. Unlike other tests that rely on human-centric questions and expected answers, or on pattern-matching methods, the test here introduced is grounded on fundamental mathematical areas of randomness and optimal inference. We argue that human-agnostic metrics based on the universal principles established by Algorithmic Information Theory (AIT) formally framing the concepts of model abstraction and prediction offer a powerful metrological framework. When applied to frontiers models, the leading LLMs outperform most others in multiple tasks, but they do not always do so with their latest model versions, which often regress and appear far from any global maximum or target estimated using the principles of AIT defining a Universal Intelligence (UAI) point and trend in the benchmarking. Conversely, a hybrid neuro-symbolic approach to UAI based on the same principles is shown to outperform frontier specialised prediction models in a simplified but relevant example related to compression-based model abstraction and sequence prediction. Finally, we prove and conclude that predictive power through arbitrary formal theories is directly proportional to compression over the algorithmic space, not the statistical space, and so further AI models' progress can only be achieved in combination with symbolic approaches that LLMs developers are adopting often without acknowledgement or realisation.

Paper Structure

This paper contains 66 sections, 1 theorem, 40 equations, 15 figures, 3 tables, 1 algorithm.

Key Result

Theorem 1

Let $x = x_1 x_2 \ldots x_n \ldots$ be an infinite sequence (or equivalently, a real number). Then, $x$ is algorithmic random iff there is no (left) semicomputable martingale that succeeds on $x$.

Figures (15)

  • Figure 1: Percentage of accuracy on binary "climbers" and random binary sequences by LLM models specialising in time series prediction compared with BDM. That climbers (up) were better predicted is expected from models that are able to intrinsically characterise and better predict simpler sequences. Sequence prediction is a fundamental problem in science, from genetics to protein folding in biology to digital twin technology in medicine and healthcare.
  • Figure 2: Quantitative Agreement of Monotonic Sequence Increase of Complexity: Comparison of BDM, Shannon Entropy, average length of Zip and LZW over the time series generated to test LLMs. Sequences chosen for each complexity class follow a pattern of increasing complexity in all cases, according to both statistical and algorithmic measures, and are used to build the testing sets, divided into three complexity groups, against which LLMs will be assessed.
  • Figure 3: Comprehensive analysis of formulae generation for numerical sequences of increasing complexity. Top left: Percentage of equivalence between generated formulae, measuring output similarity and solution diversity. Top right: Accuracy rates showing correct replication of target numeric sequences across complexity levels. Bottom: Integrated view combining formula generation volume (gold line, secondary axis) with type distribution among both total (lighter bars) and accurate (darker bars) responses, categorised as known sequences (blue), pure mathematical expressions (green), and not found (red). The results demonstrate a direct correlation between sequence complexity and diminished model performance, with particularly stark degradation in equivalence rates suggesting limited solution diversity. The integrated bottom panel reveals that whilst models may generate valid formulae at lower complexities, the proportion of accurate responses declines precipitously, and reliance on known sequences dominates over novel mathematical reasoning. These limitations are especially pronounced in contexts permitting complete freedom to discover diverse yet correct solutions, underscoring an absence of genuine creativity and mathematical understanding, attributes often mistakenly attributed to these models zhao2024assessingCreativity. Notably, newer versions of ChatGPT-o1, Grok, and Gemini performed worse than their preview iterations (see Supplementary Information).
  • Figure 4: Comprehensive analysis of language model performance in Python script generation across complexity levels (Low, Medium, High). Up: Equivalence percentage (left) and accuracy (right) versus complexity. Bottom: For each model, semi-transparent left bars show total script type distribution (Known sequence=red, Not found=blue, Pure math=green, Print=orange); solid right bars show accurate predictions only; gold diamonds (right y-axis) indicate valid script volume. Disparity between left/right bar heights quantifies the accuracy gap. Results expose fundamental LLM limitations: whilst models generate coherent solutions, accuracy deteriorates markedly with complexity. Predominance of 'Not found' (blue) at higher complexities indicates systematic failure to recognise solution strategies. Upper trajectories show equivalence remains stable whilst accuracy plummets—models generate internally consistent but incorrect approaches. Without analogous training exemplars, LLMs cannot reliably deduce solutions despite extensive Python training. Notably, newer iterations (ChatGPT-5, Grok, Gemini) underperformed preview versions (see Supplementary Information), challenging assumptions of monotonic improvement.
  • Figure 5: Benchmarking plot from Table \ref{['tableRanking']} showing how most frontier models are close to each other in their performance under this test and far from AGI or ASI goals according to this test. ASI would be able to distinguish simpler from complex sequences and generate predictive models for each accordingly, as AIXI hutter2005universal or CTM/BDM would do nmibdm as instantiations of Universal AI (UAI) that we take as an example of ASI as optimal abstraction and prediction. Today, LLMs only produce or retrieve models for sequences that were seen and found in their original training sets, given that increasing the sequences' lengths impacts the LLM performance in identifying the sequence, hence indicating sequences are not recognised from first principles but from simplistic pattern matching.
  • ...and 10 more figures

Theorems & Definitions (3)

  • Theorem 1: incompressibility and unpredictability
  • proof : Proof (Compression implies Prediction):
  • proof : Proof (Prediction implies Compression):