Table of Contents
Fetching ...

Matching domain experts by training from scratch on domain knowledge

Xiaoliang Luo, Guangzhi Sun, Bradley C. Love

TL;DR

The paper investigates whether expert-level predictions in neuroscience can emerge from domain-specific auto-regressive training on text, rather than broad emergent reasoning. It evaluates several GPT-2 variants on BrainBench, a forward-looking benchmark built from neuroscience abstracts, using perplexity-based choices between original and altered texts. The results show that a 124M GPT-2 finetuned on neuroscience data and a scratch-trained 124M with a neuro-specific tokenizer can match human experts, and a 774M pretrained model can even surpass them, with tokenization and contextual integration playing key roles. The findings imply that targeted, domain-focused training and tokenization can yield high-performance, human-level predictions with relatively small models, offering efficient paths for domain-specific AI and insights into how scientific knowledge is captured by language models.

Abstract

Recently, large language models (LLMs) have outperformed human experts in predicting the results of neuroscience experiments (Luo et al., 2024). What is the basis for this performance? One possibility is that statistical patterns in that specific scientific literature, as opposed to emergent reasoning abilities arising from broader training, underlie LLMs' performance. To evaluate this possibility, we trained (next word prediction) a relatively small 124M-parameter GPT-2 model on 1.3 billion tokens of domain-specific knowledge. Despite being orders of magnitude smaller than larger LLMs trained on trillions of tokens, small models achieved expert-level performance in predicting neuroscience results. Small models trained on the neuroscience literature succeeded when they were trained from scratch using a tokenizer specifically trained on neuroscience text or when the neuroscience literature was used to finetune a pretrained GPT-2. Our results indicate that expert-level performance may be attained by even small LLMs through domain-specific, auto-regressive training approaches.

Matching domain experts by training from scratch on domain knowledge

TL;DR

The paper investigates whether expert-level predictions in neuroscience can emerge from domain-specific auto-regressive training on text, rather than broad emergent reasoning. It evaluates several GPT-2 variants on BrainBench, a forward-looking benchmark built from neuroscience abstracts, using perplexity-based choices between original and altered texts. The results show that a 124M GPT-2 finetuned on neuroscience data and a scratch-trained 124M with a neuro-specific tokenizer can match human experts, and a 774M pretrained model can even surpass them, with tokenization and contextual integration playing key roles. The findings imply that targeted, domain-focused training and tokenization can yield high-performance, human-level predictions with relatively small models, offering efficient paths for domain-specific AI and insights into how scientific knowledge is captured by language models.

Abstract

Recently, large language models (LLMs) have outperformed human experts in predicting the results of neuroscience experiments (Luo et al., 2024). What is the basis for this performance? One possibility is that statistical patterns in that specific scientific literature, as opposed to emergent reasoning abilities arising from broader training, underlie LLMs' performance. To evaluate this possibility, we trained (next word prediction) a relatively small 124M-parameter GPT-2 model on 1.3 billion tokens of domain-specific knowledge. Despite being orders of magnitude smaller than larger LLMs trained on trillions of tokens, small models achieved expert-level performance in predicting neuroscience results. Small models trained on the neuroscience literature succeeded when they were trained from scratch using a tokenizer specifically trained on neuroscience text or when the neuroscience literature was used to finetune a pretrained GPT-2. Our results indicate that expert-level performance may be attained by even small LLMs through domain-specific, auto-regressive training approaches.
Paper Structure (14 sections, 1 equation, 6 figures, 2 tables)

This paper contains 14 sections, 1 equation, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Performance of human experts and models on BrainBench. Two configurations of GPT-2 models achieve human-level performance on BrainBench: one by fine-tuning the pretrained GPT-2 on neuroscience literature, and the other by using a new tokenizer (GPT2-Neuro) trained on neuroscience literature and retraining GPT-2 from scratch with only neuroscience data. Versions of GPT-2 that are untrained, pretrained, or trained solely on neuroscience data without these modifications underperform compared to experts on BrainBench. Larger GPT-2 (774M) trained using neuroscience literature surpassed human-level performance.
  • Figure 2: Token Analysis. (A) The pretrained GPT-2 tokenizer and the neuro-tokenizer share 47.9% of their vocabularies. (B-C) Of the vocabularies from the two tokenizers, 12.0% of the tokens from the pretrained tokenizer are commonly associated with neuroscience according to GPT-4, compared to 25.4% of the tokens from the neuro-tokenizer.
  • Figure 3: Tokenization examples. Compared to a pretrained tokenizer trained on general text, a neuro-tokenizer trained specifically on neuroscience literature better preserves domain-specific terminology, such as brain regions, in neuroscience.
  • Figure 4: Accuracy and confidence are calibrated for human experts and GPT-2 variants pretrained on neuroscience literature. When human experts and models are confident in their BrainBench judgments, they are more likely to be correct. Confidence ratings were sorted and placed in equally-sized bins with the mean accuracy for items in that bin plotted. The positive slope of the black regression lines for human experts and all models indicates that confidence is well calibrated (i.e., higher confidence corresponds to higher accuracy). Calibration is beneficial for human-machine teams.
  • Figure 5: GPT-2 variants pretrained on neuroscience literature integrate contextual information to succeed on BrainBench. The removal of background and method sections from abstracts, with an evaluation based solely on individual sentences and result alternations, significantly impairs the performance of models on BrainBench. Much of the human-level performance appears to arise from integrating information across the abstract.
  • ...and 1 more figures