Table of Contents
Fetching ...

Beyond Human-Like Processing: Large Language Models Perform Equivalently on Forward and Backward Scientific Text

Xiaoliang Luo, Michael Ramscar, Bradley C. Love

Abstract

The impressive performance of large language models (LLMs) has led to their consideration as models of human language processing. Instead, we suggest that the success of LLMs arises from the flexibility of the transformer learning architecture. To evaluate this conjecture, we trained LLMs on scientific texts that were either in a forward or backward format. Despite backward text being inconsistent with the structure of human languages, we found that LLMs performed equally well in either format on a neuroscience benchmark, eclipsing human expert performance for both forward and backward orders. Our results are consistent with the success of transformers across diverse domains, such as weather prediction and protein design. This widespread success is attributable to LLM's ability to extract predictive patterns from any sufficiently structured input. Given their generality, we suggest caution in interpreting LLM's success in linguistic tasks as evidence for human-like mechanisms.

Beyond Human-Like Processing: Large Language Models Perform Equivalently on Forward and Backward Scientific Text

Abstract

The impressive performance of large language models (LLMs) has led to their consideration as models of human language processing. Instead, we suggest that the success of LLMs arises from the flexibility of the transformer learning architecture. To evaluate this conjecture, we trained LLMs on scientific texts that were either in a forward or backward format. Despite backward text being inconsistent with the structure of human languages, we found that LLMs performed equally well in either format on a neuroscience benchmark, eclipsing human expert performance for both forward and backward orders. Our results are consistent with the success of transformers across diverse domains, such as weather prediction and protein design. This widespread success is attributable to LLM's ability to extract predictive patterns from any sufficiently structured input. Given their generality, we suggest caution in interpreting LLM's success in linguistic tasks as evidence for human-like mechanisms.

Paper Structure

This paper contains 18 sections, 1 equation, 6 figures, 1 table.

Figures (6)

  • Figure 1: Forward and backward tokenization and training. Both forward and backward trained models were optimized to predict the next token in the training data sequence. (A) The forward tokenizer and models were trained on 20 years of neuroscience literature. (B) In contrast, the backward tokenizer and models were trained on the same data with text reversed at the character level.
  • Figure 2: BrainBench is a benchmark for neuroscience. (A) BrainBench evaluates test-takers' ability to predict neuroscience results. Test-takers chose between the original abstract and one altered to significantly change the result while maintaining coherency. (B) Human experts and Language Models (LLMs) were tasked with selecting the correct (i.e., original) version from the two options. Human experts made choices, and provided confidence and expertise ratings in an online study. LLMs were scored as choosing the abstract with the lower perplexity (i.e., the text passage that was less surprising to the model) and their confidence was proportional to the difference in perplexity between the two options. Figure adapted from luo_large_2024.
  • Figure 3: BrainBench performance of GPT-2 models trained forward and backward. GPT-2 models, trained from scratch on two decades of neuroscience literature, rival or exceed human expert performance, demarcated by the blue dashed line. Models trained on the same data reversed at the character level performed non-significantly better than their forward-trained counterparts.
  • Figure 4: Backward-trained models exhibit higher perplexities on both validation and BrainBench items. (A) Perplexity of the validation set items; (B) Perplexity of the correct options in BrainBench items.
  • Figure 5: Comparison of model and human judgments on BrainBench difficulty. Model judgments (both forward and backward-trained) correlate more strongly with each other than with human expert judgments. Backward-trained models show significantly lower correlation to human judgments compared to forward-trained models.
  • ...and 1 more figures