Table of Contents
Fetching ...

Thinking Aloud: Dynamic Context Generation Improves Zero-Shot Reasoning Performance of GPT-2

Gregor Betz, Kyle Richardson, Christian Voigt

TL;DR

This paper investigates whether dynamic problem elaboration—self-generated context expansions—can improve zero-shot reasoning in GPT-2 on a deductive, multi-hop task. It introduces ChainRuler, a synthetic dataset with facts, rule chains, and distractors, and compares multiple elaboration strategies (Free, Fewshot IC/PC/PCIC, Structured, Recursive, Piecemeal) generated by the same model. Findings show that dynamic elaborations can de-bias the model's simple heuristic and increase accuracy by up to 9 percentage points, with the largest gains (up to 24 percentage points) when elaborations are highly faithful to the problem; elaboration type and coherence strongly influence effectiveness. The results highlight that context generation is central to the model's reasoning performance, offering a path toward enabling meta-cognitive thinking in language models without fine-tuning.

Abstract

Thinking aloud is an effective meta-cognitive strategy human reasoners apply to solve difficult problems. We suggest to improve the reasoning ability of pre-trained neural language models in a similar way, namely by expanding a task's context with problem elaborations that are dynamically generated by the language model itself. Our main result is that dynamic problem elaboration significantly improves the zero-shot performance of GPT-2 in a deductive reasoning and natural language inference task: While the model uses a syntactic heuristic for predicting an answer, it is capable (to some degree) of generating reasoned additional context which facilitates the successful application of its heuristic. We explore different ways of generating elaborations, including fewshot learning, and find that their relative performance varies with the specific problem characteristics (such as problem difficulty). Moreover, the effectiveness of an elaboration can be explained in terms of the degree to which the elaboration semantically coheres with the corresponding problem. In particular, elaborations that are most faithful to the original problem description may boost accuracy by up to 24%.

Thinking Aloud: Dynamic Context Generation Improves Zero-Shot Reasoning Performance of GPT-2

TL;DR

This paper investigates whether dynamic problem elaboration—self-generated context expansions—can improve zero-shot reasoning in GPT-2 on a deductive, multi-hop task. It introduces ChainRuler, a synthetic dataset with facts, rule chains, and distractors, and compares multiple elaboration strategies (Free, Fewshot IC/PC/PCIC, Structured, Recursive, Piecemeal) generated by the same model. Findings show that dynamic elaborations can de-bias the model's simple heuristic and increase accuracy by up to 9 percentage points, with the largest gains (up to 24 percentage points) when elaborations are highly faithful to the problem; elaboration type and coherence strongly influence effectiveness. The results highlight that context generation is central to the model's reasoning performance, offering a path toward enabling meta-cognitive thinking in language models without fine-tuning.

Abstract

Thinking aloud is an effective meta-cognitive strategy human reasoners apply to solve difficult problems. We suggest to improve the reasoning ability of pre-trained neural language models in a similar way, namely by expanding a task's context with problem elaborations that are dynamically generated by the language model itself. Our main result is that dynamic problem elaboration significantly improves the zero-shot performance of GPT-2 in a deductive reasoning and natural language inference task: While the model uses a syntactic heuristic for predicting an answer, it is capable (to some degree) of generating reasoned additional context which facilitates the successful application of its heuristic. We explore different ways of generating elaborations, including fewshot learning, and find that their relative performance varies with the specific problem characteristics (such as problem difficulty). Moreover, the effectiveness of an elaboration can be explained in terms of the degree to which the elaboration semantically coheres with the corresponding problem. In particular, elaborations that are most faithful to the original problem description may boost accuracy by up to 24%.

Paper Structure

This paper contains 17 sections, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Dynamic problem elaboration of a reasoning task, illustrated by an example drawn from the ChainRuler dataset.
  • Figure 2: Illustration of the different methods for eliciting and generating problem elaborations studied in this paper.
  • Figure 3: Accuracy in ChainRuler tasks of given effective distraction and depth. Subplots (a): absolute accuracy without elaboration (baseline none). Subplots (b): relative accuracy gains for best-performing elaboration compared to baseline none. Subplots (c): name of best-performing elaboration.
  • Figure 4: Prediction score of the correct answer (conclusion) and total epistemic luck -- classified according to underlying elaboration type. (a): Mean prediction score in function of epistemic luck, baseline none thick, colors as in (b). (b): Distribution of total luck per problem for different types of problem elaboration, mean increase relative to baseline none in brackets.
  • Figure 5: Accuracy in ChainRuler task for six types of elaborations as a function of (a) their verisimilitude, that is the semantic similarity between generated elaboration and correct answer (conclusion), (b) their pertinence, that is the semantic similarity between generated elaboration and sequence of possible answers, and (c) their faithfulness, that is the semantic similarity between generated elaboration and context. Top row: without contraposition. Bottom row: with contraposition.
  • ...and 2 more figures