Table of Contents
Fetching ...

Doing Experiments and Revising Rules with Natural Language and Probabilistic Reasoning

Wasu Top Piriyakulkij, Cassidy Langenfeld, Tuan Anh Le, Kevin Ellis

TL;DR

This work gives a model of how to infer natural language rules by doing experiments, and compares with recent algorithms for using LLMs to generate and revise hypotheses, finding that the online inference method yields higher accuracy at recovering the true underlying rule, and provides better support for designing optimal experiments.

Abstract

We give a model of how to infer natural language rules by doing experiments. The model integrates Large Language Models (LLMs) with Monte Carlo algorithms for probabilistic inference, interleaving online belief updates with experiment design under information-theoretic criteria. We conduct a human-model comparison on a Zendo-style task, finding that a critical ingredient for modeling the human data is to assume that humans also consider fuzzy, probabilistic rules, in addition to assuming that humans perform approximately-Bayesian belief updates. We also compare with recent algorithms for using LLMs to generate and revise hypotheses, finding that our online inference method yields higher accuracy at recovering the true underlying rule, and provides better support for designing optimal experiments.

Doing Experiments and Revising Rules with Natural Language and Probabilistic Reasoning

TL;DR

This work gives a model of how to infer natural language rules by doing experiments, and compares with recent algorithms for using LLMs to generate and revise hypotheses, finding that the online inference method yields higher accuracy at recovering the true underlying rule, and provides better support for designing optimal experiments.

Abstract

We give a model of how to infer natural language rules by doing experiments. The model integrates Large Language Models (LLMs) with Monte Carlo algorithms for probabilistic inference, interleaving online belief updates with experiment design under information-theoretic criteria. We conduct a human-model comparison on a Zendo-style task, finding that a critical ingredient for modeling the human data is to assume that humans also consider fuzzy, probabilistic rules, in addition to assuming that humans perform approximately-Bayesian belief updates. We also compare with recent algorithms for using LLMs to generate and revise hypotheses, finding that our online inference method yields higher accuracy at recovering the true underlying rule, and provides better support for designing optimal experiments.
Paper Structure (39 sections, 10 equations, 13 figures, 5 tables)

This paper contains 39 sections, 10 equations, 13 figures, 5 tables.

Figures (13)

  • Figure 1: Alternation of experimentation and hypothesis generation on a simplified version of our ActiveACRE domain. Hypotheses characterizes what causes the machine to activate (make noise).
  • Figure 2: Sequential Monte Carlo method tracks a small number of hypotheses (called particles), each of which is a natural language rule, represented above by circles. After each experiment, the particles are revised in light of the new data by pushing the particles through the forward kernel. Then, the new particles are reweighed according to how well each explains the data we have seen so far. Resampling prunes low-probability hypotheses while multiplying high-probability ones.
  • Figure 3: (a) Example Zendo scene and its serialization into text. (b) Eight experiments, each of which is a scene, with a binary outcome (whether the scene makes stars come out of it). (c) Test scenes that evaluate whether a model or human has correctly inferred the hidden rule.
  • Figure 4: Human vs model accuracy binned by 4 rule-following (RF) and 4 not rule-following (Not RF) test scenes. (a) Each point is a RF or Not RF accuracy for the 10 rules. (b) Rows/columns are methods/rules. Online inference with fuzzy rules (last row) most closely matches humans.
  • Figure 5: Comparing human and model prediction on each test scene after 7 rounds of experimentation; see also \ref{['tab:logl']}. Each point is a prediction on a test scene. We only present LLM, best batch model, and best online model here. Please see the figure for all methods at \ref{['fig:r2_lower']}.
  • ...and 8 more figures