Table of Contents
Fetching ...

Solomonoff-Inspired Hypothesis Ranking with LLMs for Prediction Under Uncertainty

Josh Barber, Rourke Young, Cameron Coombe, Will Browne

TL;DR

The paper tackles uncertainty in abstract reasoning by ranking and combining multiple LLM-generated hypotheses with a Solomonoff-inspired scoring scheme that jointly favours simplicity and data fit. It builds a finite, computable hypothesis pool, creates a per-cell weighted prediction matrix, and compares against Bayesian Model Averaging on Mini-ARC tasks. Key findings show better uncertainty calibration for the Solomonoff approach in the presence of noisy hypotheses, while BMA can yield sharper predictions when hypotheses are reliable. The work demonstrates the value of algorithmic information-theoretic priors for interpretable, robust multi-hypothesis reasoning under data sparsity, with potential extensions to robotics and larger ARC benchmarks.

Abstract

Reasoning under uncertainty is a key challenge in AI, especially for real-world tasks, where problems with sparse data demands systematic generalisation. Existing approaches struggle to balance accuracy and simplicity when evaluating multiple candidate solutions. We propose a Solomonoff-inspired method that weights LLM-generated hypotheses by simplicity and predictive fit. Applied to benchmark (Mini-ARC) tasks, our method produces Solomonoff-weighted mixtures for per-cell predictions, yielding conservative, uncertainty-aware outputs even when hypotheses are noisy or partially incorrect. Compared to Bayesian Model Averaging (BMA), Solomonoff scoring spreads probability more evenly across competing hypotheses, while BMA concentrates weight on the most likely but potentially flawed candidates. Across tasks, this highlights the value of algorithmic information-theoretic priors for interpretable, reliable multi-hypothesis reasoning under uncertainty.

Solomonoff-Inspired Hypothesis Ranking with LLMs for Prediction Under Uncertainty

TL;DR

The paper tackles uncertainty in abstract reasoning by ranking and combining multiple LLM-generated hypotheses with a Solomonoff-inspired scoring scheme that jointly favours simplicity and data fit. It builds a finite, computable hypothesis pool, creates a per-cell weighted prediction matrix, and compares against Bayesian Model Averaging on Mini-ARC tasks. Key findings show better uncertainty calibration for the Solomonoff approach in the presence of noisy hypotheses, while BMA can yield sharper predictions when hypotheses are reliable. The work demonstrates the value of algorithmic information-theoretic priors for interpretable, robust multi-hypothesis reasoning under data sparsity, with potential extensions to robotics and larger ARC benchmarks.

Abstract

Reasoning under uncertainty is a key challenge in AI, especially for real-world tasks, where problems with sparse data demands systematic generalisation. Existing approaches struggle to balance accuracy and simplicity when evaluating multiple candidate solutions. We propose a Solomonoff-inspired method that weights LLM-generated hypotheses by simplicity and predictive fit. Applied to benchmark (Mini-ARC) tasks, our method produces Solomonoff-weighted mixtures for per-cell predictions, yielding conservative, uncertainty-aware outputs even when hypotheses are noisy or partially incorrect. Compared to Bayesian Model Averaging (BMA), Solomonoff scoring spreads probability more evenly across competing hypotheses, while BMA concentrates weight on the most likely but potentially flawed candidates. Across tasks, this highlights the value of algorithmic information-theoretic priors for interpretable, reliable multi-hypothesis reasoning under uncertainty.

Paper Structure

This paper contains 26 sections, 13 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Mini-ARC task where odd columns shift down and even columns up; the agent infers this rule from input–output examples (see Table \ref{['tab:task-a-hypotheses']}). The nth output is withheld for illustration but included in the dataset.
  • Figure 2: Overall experimental pipeline: (1) Input task and object extraction, (2) Hypothesis generation via LLM, (3) Evaluation using accuracy and simplicity, (4) Solomonoff-inspired or Bayesian weighting, and (5) Aggregated per-cell predictions.
  • Figure 3: Task A - Alternating column shifts with six generated hypotheses. Comparison of Solomonoff-weighted vs Bayesian Model Averaging (BMA) predictions. Solomonoff achieved 64% accuracy vs 48% for BMA.
  • Figure 4: Task B - Centralisation task. Both Solomonoff and BMA predictions achieve 100% accuracy, though Solomonoff shows more conservative confidence calibration.
  • Figure 5: Task C – Alternating column shifts with 20 noisy hypotheses. Accuracy drops for both methods (Solomonoff 60%, BMA 64%), highlighting overconfidence versus cautious weighting under hypothesis noise. Compared to Task A, overall confidence and correctness are lower.