Tiny Moves: Game-based Hypothesis Refinement

Agnieszka Dobrowolska; Rogier Hintzen; Martin Balla; Karl Gemayel; Sabine Reichert; Thomas Charman; Jen Ning Lim; Lindsay Edwards; Anna Gogleva

Tiny Moves: Game-based Hypothesis Refinement

Agnieszka Dobrowolska, Rogier Hintzen, Martin Balla, Karl Gemayel, Sabine Reichert, Thomas Charman, Jen Ning Lim, Lindsay Edwards, Anna Gogleva

TL;DR

The paper introduces The Hypothesis Game, a symbolic, move-based framework for hypothesis refinement that makes scientific reasoning explicit through a shared hypothesis state and a fixed grammar of moves. Implemented with a central LLM controller (Game Master), the minimal game is evaluated on Reactome-derived pathway tasks, showing superior error removal and precision in corruption recovery and competitive performance in reconstruction from partial cues. The work demonstrates that incremental, interpretable edits can improve transferability and controllability of AI-driven scientific discovery, while also outlining clear avenues for richer representations and learned controllers. Overall, game-based reasoning offers a principled route to more interpretable, reusable, and robust hypothesis refinement systems for scientific progress.

Abstract

Most machine learning approaches to scientific discovery frame hypotheses as end-to-end predictions, obscuring the incremental structure of scientific reasoning. We propose The Hypothesis Game, a symbolic formalism for hypothesis refinement in which LLM agents operate on a shared hypothesis state using a fixed grammar of reasoning moves. The framework is motivated by the observation that scientific progress often proceeds through small, localized revisions, grounded in domain context, rather than extensive rewrites. We instantiate a minimal game with LLM agents and evaluate it on pathway-level mechanistic refinement tasks. In the primary setting of corruption recovery, where hypotheses contain controlled errors, the game-based approach consistently removes more errors and achieves higher precision than strong prompting baselines, while preserving valid structure through incremental edits. In a secondary reconstruction setting from partial cues, it performs comparably to the strongest baseline, indicating that explicit move-based refinement remains competitive even when ground-truth recovery is difficult. These findings support game-based reasoning as a principled route to more controllable, interpretable, and transferable hypothesis refinement systems for scientific discovery.

Tiny Moves: Game-based Hypothesis Refinement

TL;DR

Abstract

Paper Structure (49 sections, 7 equations, 10 figures, 7 tables, 2 algorithms)

This paper contains 49 sections, 7 equations, 10 figures, 7 tables, 2 algorithms.

Introduction
Framework
Hypothesis Representation
Reasoning Grammar (Moves)
Game Modes
Scoring
Game variants
Implementation
Experiment set-up
Task Setup
Common Experimental Principles
Results
Qualitative observations.
Corruption task.
Reconstruction task.
...and 34 more sections

Figures (10)

Figure 1: A conceptual framework for reasoning games. The objective of the game is to evolve a hypothesis fragment through a sequence of reasoning moves, with progress assessed through properties such as novelty, coherence, and traceability. *Graph structures shown for conceptual illustration only; actual implementation uses structured text fragments with equivalent reasoning operations.
Figure 2: Conceptual illustration of the two evaluation tasks. Left: corruption recovery, where controlled errors are introduced into a valid pathway and the system must detect and repair them while preserving correct structure. Right: reconstruction from partial cues, where a system recovers pathway steps starting from sparse contextual input and external biomedical evidence.
Figure 3: Representative example run of Hypothesis Game and ReAct on the corruption task, illustrating incremental vs large single-step edits. *Other changes are quantified as (1) the number of biological entity additions/removals and (2) word-level normalised Levenshtein distance to the reference pathway. See Fig. \ref{['fig:hypothesis_drift']} for details.
Figure 4: Comparison of Hypothesis Game vs. prompting baselines on two pathway-level tasks. Bars show averages over the evaluation sets described in the text. The error bars show 95% confidence intervals. Top row: Corruption; Hypothesis Game balances error removal and retention of valid content, achieving the highest precision, F1 and error removal rate (for all scores Friedman test $p<0.0001$, post-hoc Wilcoxon test with Bonferroni correction $p<0.0005$). Bottom row: Reconstruction; All methods struggled with faithfully reconstructing the pathways. ReAct and Hypothesis Game had a statistically non-significant difference in F1 score, but Hypothesis Game performed significantly better in Detailed Recall of pathways (Friedman test, $\chi^2(3)=84.3, p<0.0001$, post-hoc Wilcoxon test with Bonferroni correction $p < 0.001$).
Figure 5: Aggregation of all results on the corruption task based on error type. Error bars show 95% confidence intervals.
...and 5 more figures

Tiny Moves: Game-based Hypothesis Refinement

TL;DR

Abstract

Tiny Moves: Game-based Hypothesis Refinement

Authors

TL;DR

Abstract

Table of Contents

Figures (10)