Table of Contents
Fetching ...

WorldSense: A Synthetic Benchmark for Grounded Reasoning in Large Language Models

Youssef Benchekroun, Megi Dervishi, Mark Ibrahim, Jean-Baptiste Gaya, Xavier Martinet, Grégoire Mialon, Thomas Scialom, Emmanuel Dupoux, Dieuwke Hupkes, Pascal Vincent

TL;DR

WorldSense introduces a bias-controlled synthetic benchmark to probe whether large language models maintain tacit world models by solving grounded inferences, consistency, and completeness tasks based on linear-order descriptions. The study reveals that state-of-the-art chat models struggle with grounding from verbal descriptions and exhibit substantial response biases, with limited gains from chain-of-thought prompting or in-context learning. Finetuning on WorldSense data improves performance within a linear relationship class and shows limited generalization, while not yielding broad improvements on external reasoning benchmarks. The authors release the WorldSense benchmark and datasets to spur further exploration of internal world representations in LLMs and outline directions for extending the evaluation to dynamic and planning tasks.

Abstract

We propose WorldSense, a benchmark designed to assess the extent to which LLMs are consistently able to sustain tacit world models, by testing how they draw simple inferences from descriptions of simple arrangements of entities. Worldsense is a synthetic benchmark with three problem types, each with their own trivial control, which explicitly avoids bias by decorrelating the abstract structure of problems from the vocabulary and expressions, and by decorrelating all problem subparts with the correct response. We run our benchmark on three state-of-the-art chat-LLMs (GPT3.5, GPT4 and Llama2-chat) and show that these models make errors even with as few as three objects. Furthermore, they have quite heavy response biases, preferring certain responses irrespective of the question. Errors persist even with chain-of-thought prompting and in-context learning. Lastly, we show that while finetuning on similar problems does result in substantial improvements -- within- and out-of-distribution -- the finetuned models do not generalise beyond a constraint problem space.

WorldSense: A Synthetic Benchmark for Grounded Reasoning in Large Language Models

TL;DR

WorldSense introduces a bias-controlled synthetic benchmark to probe whether large language models maintain tacit world models by solving grounded inferences, consistency, and completeness tasks based on linear-order descriptions. The study reveals that state-of-the-art chat models struggle with grounding from verbal descriptions and exhibit substantial response biases, with limited gains from chain-of-thought prompting or in-context learning. Finetuning on WorldSense data improves performance within a linear relationship class and shows limited generalization, while not yielding broad improvements on external reasoning benchmarks. The authors release the WorldSense benchmark and datasets to spur further exploration of internal world representations in LLMs and outline directions for extending the evaluation to dynamic and planning tasks.

Abstract

We propose WorldSense, a benchmark designed to assess the extent to which LLMs are consistently able to sustain tacit world models, by testing how they draw simple inferences from descriptions of simple arrangements of entities. Worldsense is a synthetic benchmark with three problem types, each with their own trivial control, which explicitly avoids bias by decorrelating the abstract structure of problems from the vocabulary and expressions, and by decorrelating all problem subparts with the correct response. We run our benchmark on three state-of-the-art chat-LLMs (GPT3.5, GPT4 and Llama2-chat) and show that these models make errors even with as few as three objects. Furthermore, they have quite heavy response biases, preferring certain responses irrespective of the question. Errors persist even with chain-of-thought prompting and in-context learning. Lastly, we show that while finetuning on similar problems does result in substantial improvements -- within- and out-of-distribution -- the finetuned models do not generalise beyond a constraint problem space.
Paper Structure (38 sections, 6 figures, 11 tables)

This paper contains 38 sections, 6 figures, 11 tables.

Figures (6)

  • Figure 1: The three problem types of WorldSense. a. Grounded inferences test a model's ability to generate a world state from a verbal description and inspect the world states to answer queries about them. For a language model, difficulties may arise in translating from text to world states or back, or in maintaining coherent world states. b. Consistency Detection consists in detecting whether the verbal description contains a contradiction, and no possible world state can be generated from it. c. Completeness Detection consists in detecting whether the description enables several alternative world states giving rise to conflicting responses to the query.
  • Figure 2: Generation of WorldSense completeness problems. Left: Complete problem with 4 entities. Right: Incomplete problem derived from the complete problem. Descriptions are list of verbalised binary or ternary relations. The semantic graph represent the underlying total or partial order. World state(s) are represented as objects in a 1D left-to-right disposition. Queries are verbalised relations, whose truth values are either determinate (white) or undeterminate (red). The incomplete problem is generated from the complete problem by first randomly picking a relation from the description say $"2<3"$; randomly pick either the left entity '2' or the right entity '3' and replace it with its direct neighbour resp. '1' or '4'. The new relation becomes resp. $"1<3"$ or $"2 < 4"$.
  • Figure 3: Main WorldSense results across problem types. Left: Accuracy of three chat models split out by problem type (inference, consistency, completeness) and condition (trivial, normal). The horizontal black line indicates chance level. Right: Response bias of models split out by problem type and condition. +1 indicates a positive bias (TRUE, POSSIBLE, KNOWN), -1 a negative bias (FALSE, IMPOSSIBLE, UNKNOWN), 0 indicates no response bias. For both plots, error bars denote 95% confidence intervals.
  • Figure 4: Prompting enhancement results. Left: Accuracy across different prompting strategies, averaged over problem types and conditions. Right: Average response bias amplitude (absolute value) across the 6 problem types. The black line indicates the values corresponding to the basic prompting setup. Chance levels for accuracies are 50%, error bars denote 95% confidence intervals for both plots.
  • Figure 5: Finetuning results. Left: Accuracy across Llama2$_{70B}$ models finetuned on 0, 100K and 1M training examples, on the WorldSense (WS) test set split into in-domain (WS-ind) and out-of-domain (WS-ood) subsets, plus a memorisation test set (Mem.), and the ood-size, ood-query and ood-problem generalisation test sets. Right: Bias amplitude on the memorisation, WS and length generalisation test sets. Chance levels for accuracies are 50%, error bars denote 95% confidence intervals for both plots.
  • ...and 1 more figures