Table of Contents
Fetching ...

Loose LIPS Sink Ships: Asking Questions in Battleship with Language-Informed Program Sampling

Gabriel Grand, Valerio Pepe, Jacob Andreas, Joshua B. Tenenbaum

TL;DR

This work investigates how people generate informative questions in a grounded setting by modeling question-asking as a resource-bounded Bayesian search. The authors introduce Language-Informed Program Sampling (LIPS), which uses large language models as priors over questions and as means to translate natural language into a Language of Thought (LoT) represented by a domain-specific probabilistic program; the Expected Information Gain (EIG) of candidate questions is computed to select the most informative query via Monte Carlo sampling. Key findings show that with modest internal computation (small k), LIPS matches or exceeds human mean informativity in Battleship contexts, while pure LLM baselines struggle to ground questions in the board state, and even GPT-4V fails to leverage visual information for grounding. The results illuminate how Bayesian models of cognition can leverage language statistics to capture human priors, while highlighting significant grounding limitations of pure LLMs, with implications for designing information-gathering AI that can reason about uncertainty in real-world tasks.

Abstract

Questions combine our mastery of language with our remarkable facility for reasoning about uncertainty. How do people navigate vast hypothesis spaces to pose informative questions given limited cognitive resources? We study these tradeoffs in a classic grounded question-asking task based on the board game Battleship. Our language-informed program sampling (LIPS) model uses large language models (LLMs) to generate natural language questions, translate them into symbolic programs, and evaluate their expected information gain. We find that with a surprisingly modest resource budget, this simple Monte Carlo optimization strategy yields informative questions that mirror human performance across varied Battleship board scenarios. In contrast, LLM-only baselines struggle to ground questions in the board state; notably, GPT-4V provides no improvement over non-visual baselines. Our results illustrate how Bayesian models of question-asking can leverage the statistics of language to capture human priors, while highlighting some shortcomings of pure LLMs as grounded reasoners.

Loose LIPS Sink Ships: Asking Questions in Battleship with Language-Informed Program Sampling

TL;DR

This work investigates how people generate informative questions in a grounded setting by modeling question-asking as a resource-bounded Bayesian search. The authors introduce Language-Informed Program Sampling (LIPS), which uses large language models as priors over questions and as means to translate natural language into a Language of Thought (LoT) represented by a domain-specific probabilistic program; the Expected Information Gain (EIG) of candidate questions is computed to select the most informative query via Monte Carlo sampling. Key findings show that with modest internal computation (small k), LIPS matches or exceeds human mean informativity in Battleship contexts, while pure LLM baselines struggle to ground questions in the board state, and even GPT-4V fails to leverage visual information for grounding. The results illuminate how Bayesian models of cognition can leverage language statistics to capture human priors, while highlighting significant grounding limitations of pure LLMs, with implications for designing information-gathering AI that can reason about uncertainty in real-world tasks.

Abstract

Questions combine our mastery of language with our remarkable facility for reasoning about uncertainty. How do people navigate vast hypothesis spaces to pose informative questions given limited cognitive resources? We study these tradeoffs in a classic grounded question-asking task based on the board game Battleship. Our language-informed program sampling (LIPS) model uses large language models (LLMs) to generate natural language questions, translate them into symbolic programs, and evaluate their expected information gain. We find that with a surprisingly modest resource budget, this simple Monte Carlo optimization strategy yields informative questions that mirror human performance across varied Battleship board scenarios. In contrast, LLM-only baselines struggle to ground questions in the board state; notably, GPT-4V provides no improvement over non-visual baselines. Our results illustrate how Bayesian models of question-asking can leverage the statistics of language to capture human priors, while highlighting some shortcomings of pure LLMs as grounded reasoners.
Paper Structure (32 sections, 5 equations, 4 figures, 2 tables)

This paper contains 32 sections, 5 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: How do people formulate information-seeking questions in a grounded task such as the game Battleship? Given a partially-revealed board, our LIPS model (A) samples $k$ questions from a language model prior and (B) translates these into LoT programs. (C) The utility of a question is computed by simulating the program against a hypothesis space of boards compatible with the observation. Here, the best question achieves Expected Information Gain (EIG) of 0.99, meaning the answer would rule out nearly half the boards in the hypothesis space. Our model is well-suited to filtering out samples from a noisy prior that are redundant (e.g., "Is the red ship longer than 2 tiles?") or inconsistent due to lack of grounding.
  • Figure 2: We experiment with 3 different board representations: an ASCII-style grid, a textual serialization, and a visual prompt encoded as an image.
  • Figure 3: Comparing the informativity of model-generated questions against human data. (Top left) LIPS with two LLMs and a hand-engineered grammar as proposal distributions over questions. As $k$ increases, all three models reach mean-human performance, though they fall short of the best human-generated questions. (Top right) Evaluating GPT-4's performance with different prompt formats and board representations. Including few-shot examples universally boosts EIG. However, performance varies depending on the board format. Notably, GPT-4(V) was unable to utilize the board's structure in text (grid) or images (visual), implying a failure of grounding. (Bottom) Q-Q plots comparing model vs. human EIG values at varying sample sizes. At $k=5$, all three models are generally well-calibrated to humans, though they fall short of the top 10-20% of human questions. Throughout, error bars and shaded regions indicate 95% bootstrapped confidence intervals. GPT-4 and CodeLlama-7b refer to the few-shot, textual condition unless otherwise noted.
  • Figure 4: Proportion of top-level question types generated by each proposal distribution at $k=1$.