Table of Contents
Fetching ...

Critical Thinking for Language Models

Gregor Betz, Christian Voigt, Kyle Richardson

TL;DR

The paper addresses the gap that neural language models struggle with reasoning tasks by proposing a 'critical thinking curriculum' built on a synthetic corpus of deductively valid arguments. It demonstrates that intermediary pre-training on core argument schemes enables transfer to more complex argument types and improves zero-shot performance on GLUE diagnostics and SNLI, indicating broad generalization. While gains are robust for some NLU benchmarks, they do not extend to all reasoning tasks (e.g., ARC, LogiQA), highlighting limits and the need for broader curricula. Overall, the work provides a promising foundation for using synthetic, well-structured argumentative texts to seed reasoning abilities in language models and outlines concrete directions for expanding the curriculum.

Abstract

This paper takes a first step towards a critical thinking curriculum for neural auto-regressive language models. We introduce a synthetic corpus of deductively valid arguments, and generate artificial argumentative texts to train and evaluate GPT-2. Significant transfer learning effects can be observed: Training a model on three simple core schemes allows it to accurately complete conclusions of different, and more complex types of arguments, too. The language models generalize the core argument schemes in a correct way. Moreover, we obtain consistent and promising results for NLU benchmarks. In particular, pre-training on the argument schemes raises zero-shot accuracy on the GLUE diagnostics by up to 15 percentage points. The findings suggest that intermediary pre-training on texts that exemplify basic reasoning abilities (such as typically covered in critical thinking textbooks) might help language models to acquire a broad range of reasoning skills. The synthetic argumentative texts presented in this paper are a promising starting point for building such a "critical thinking curriculum for language models."

Critical Thinking for Language Models

TL;DR

The paper addresses the gap that neural language models struggle with reasoning tasks by proposing a 'critical thinking curriculum' built on a synthetic corpus of deductively valid arguments. It demonstrates that intermediary pre-training on core argument schemes enables transfer to more complex argument types and improves zero-shot performance on GLUE diagnostics and SNLI, indicating broad generalization. While gains are robust for some NLU benchmarks, they do not extend to all reasoning tasks (e.g., ARC, LogiQA), highlighting limits and the need for broader curricula. Overall, the work provides a promising foundation for using synthetic, well-structured argumentative texts to seed reasoning abilities in language models and outlines concrete directions for expanding the curriculum.

Abstract

This paper takes a first step towards a critical thinking curriculum for neural auto-regressive language models. We introduce a synthetic corpus of deductively valid arguments, and generate artificial argumentative texts to train and evaluate GPT-2. Significant transfer learning effects can be observed: Training a model on three simple core schemes allows it to accurately complete conclusions of different, and more complex types of arguments, too. The language models generalize the core argument schemes in a correct way. Moreover, we obtain consistent and promising results for NLU benchmarks. In particular, pre-training on the argument schemes raises zero-shot accuracy on the GLUE diagnostics by up to 15 percentage points. The findings suggest that intermediary pre-training on texts that exemplify basic reasoning abilities (such as typically covered in critical thinking textbooks) might help language models to acquire a broad range of reasoning skills. The synthetic argumentative texts presented in this paper are a promising starting point for building such a "critical thinking curriculum for language models."

Paper Structure

This paper contains 20 sections, 3 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Syllogistic argument schemes used to create an artificial argument corpus.
  • Figure 2: Pipeline for creating natural language instances of argument schemes with multiple templating.
  • Figure 3: Accuracy of four model versions in three conclusion completion tasks and on different test datasets (out of sample, paraphrased, out of domain).
  • Figure 4: Accuracy of conclusion completions (three tasks) for instances of different argument schemes (see Figure \ref{['fig:arg_schemes']}) and four model versions.
  • Figure 5: Gains in accuracy due to fine-tuning on the AAC (accuracy TRAIN model -- accuracy BASE model) for differently sized models and different NLP benchmark tasks: the GLUE diagnostics data, the SNLI dataset, the argument reasoning comprehension (ARC) benchmark, and the LogiQA dataset.
  • ...and 2 more figures