Large language models surpass human experts in predicting neuroscience results

Xiaoliang Luo; Akilles Rechardt; Guangzhi Sun; Kevin K. Nejad; Felipe Yáñez; Bati Yilmaz; Kangjoo Lee; Alexandra O. Cohen; Valentina Borghesani; Anton Pashkov; Daniele Marinazzo; Jonathan Nicholas; Alessandro Salatiello; Ilia Sucholutsky; Pasquale Minervini; Sepehr Razavi; Roberta Rocca; Elkhan Yusifov; Tereza Okalova; Nianlong Gu; Martin Ferianc; Mikail Khona; Kaustubh R. Patil; Pui-Shee Lee; Rui Mata; Nicholas E. Myers; Jennifer K Bizley; Sebastian Musslick; Isil Poyraz Bilgin; Guiomar Niso; Justin M. Ales; Michael Gaebler; N Apurva Ratan Murty; Leyla Loued-Khenissi; Anna Behler; Chloe M. Hall; Jessica Dafflon; Sherry Dongqi Bao; Bradley C. Love

Large language models surpass human experts in predicting neuroscience results

Xiaoliang Luo, Akilles Rechardt, Guangzhi Sun, Kevin K. Nejad, Felipe Yáñez, Bati Yilmaz, Kangjoo Lee, Alexandra O. Cohen, Valentina Borghesani, Anton Pashkov, Daniele Marinazzo, Jonathan Nicholas, Alessandro Salatiello, Ilia Sucholutsky, Pasquale Minervini, Sepehr Razavi, Roberta Rocca, Elkhan Yusifov, Tereza Okalova, Nianlong Gu, Martin Ferianc, Mikail Khona, Kaustubh R. Patil, Pui-Shee Lee, Rui Mata, Nicholas E. Myers, Jennifer K Bizley, Sebastian Musslick, Isil Poyraz Bilgin, Guiomar Niso, Justin M. Ales, Michael Gaebler, N Apurva Ratan Murty, Leyla Loued-Khenissi, Anna Behler, Chloe M. Hall, Jessica Dafflon, Sherry Dongqi Bao, Bradley C. Love

TL;DR

BrainBench introduces a forward-looking benchmark to evaluate whether large language models can predict neuroscience results better than humans. The study shows general-purpose LLMs outperform human experts in predicting study outcomes, with BrainGPT further improving through LoRA-based fine-tuning on neuroscience literature. Confidence calibration and cross-abstract context integration underpin the LLM advantage, while memorization is not the primary driver of performance. The work suggests a transferable framework for knowledge-intensive discovery and highlights the potential for human–AI teaming to accelerate science, with BrainGPT made publicly available for broader use.

Abstract

Scientific discoveries often hinge on synthesizing decades of research, a task that potentially outstrips human information processing capacities. Large language models (LLMs) offer a solution. LLMs trained on the vast scientific literature could potentially integrate noisy yet interrelated findings to forecast novel results better than human experts. To evaluate this possibility, we created BrainBench, a forward-looking benchmark for predicting neuroscience results. We find that LLMs surpass experts in predicting experimental outcomes. BrainGPT, an LLM we tuned on the neuroscience literature, performed better yet. Like human experts, when LLMs were confident in their predictions, they were more likely to be correct, which presages a future where humans and LLMs team together to make discoveries. Our approach is not neuroscience-specific and is transferable to other knowledge-intensive endeavors.

Large language models surpass human experts in predicting neuroscience results

TL;DR

Abstract

Paper Structure (17 sections, 3 equations, 30 figures, 5 tables)

This paper contains 17 sections, 3 equations, 30 figures, 5 tables.

Performance breakdown by subfields and by participant types
Do judgments from LLMs and human experts align?
LLMs can integrate information across context
LLM performance is not driven by data memorization
GPT-4 test creation prompt
Accuracy
Confidence calibration
Performance correlation across LLMs
Integration analysis
LLM training data memorization analysis
Participants
Procedure
Exclusion criteria
Performance correlation between humans and LLMs
Training data
...and 2 more sections

Figures (30)

Figure 1: Backward-looking and Forward-looking evaluations. (A) Backward-looking benchmarks involve recalling factual information. For example, in the left panel, a student retrieves a fact about the Gettysburg Address that they learned during a history class. Existing benchmarks in scientific domains are in essence backward-looking as they emphasize retrieving accepted facts for question answering and reasoning tasks. (B) Forward-looking benchmarks involve predicting novel outcomes based on past data. Two forms of uncertainty, aleatoric (due to intrinsic randomness) and epistemic (due to lack of knowledge), may be present. For example, in the right panel, a table tennis fan predicts which player will win the next set based on their knowledge of the players, how they have played so far today, and so forth. Inherent random factors, such as a breeze affecting the ball's flight, will also be present.
Figure 2: BrainBench is a forward-looking benchmark for neuroscience. BrainBench evaluates test-takers' ability to predict neuroscience results. BrainBench's test cases were sourced from recent Journal of Neuroscience abstracts across five neuroscience domains: Behavioral/Cognitive, Systems/Circuits, Neurobiology of Disease, Cellular/Molecular, and Developmental/Plasticity/Repair. Test-takers chose between the original abstract and one altered to significantly change the result while maintaining coherency. Human experts and Language Models (LLMs) were tasked with selecting the correct (i.e., original) version from the two options. Human experts made choices, and provided confidence and expertise ratings in an online study. LLMs were scored as choosing the abstract with the lower perplexity (i.e., the text passage that was less surprising to the model) and their confidence was proportional to the difference in perplexity between the two options.
Figure 3: Performance of human experts and large language models on BrainBench. (A) LLMs outperformed human experts on BrainBench. Smaller models are on par with larger models. Base versions of models outperformed chat and instruct versions, which were tuned to be conversational with humans. (B) The distribution of test cases across neuroscience subfields roughly mirrors the distribution of articles in the Journal of Neuroscience with Behavior/Cognitive overrepresented. The average performance of $15$ LLMs and human experts is shown. LLMs outperformed human experts in every subfield (see Fig. \ref{['fig:breakdown_subfield']} for full results). Error bars represent the standard deviation of accuracy around the mean. (C) Majority of the participants were doctoral students, postdoctoral researchers and faculty/academic staff. Error bars represent standard error of the mean.
Figure 4: Accuracy and confidence are calibrated for human experts and large language models (LLMs). When human experts and LLMs are confident in their BrainBench judgments, they are more likely to be correct. Confidence ratings were sorted and placed in equally-sized bins with the mean accuracy for items in that bin plotted. The positive slope of the black regression lines for human experts and all LLMs indicates that confidence is well calibrated (i.e., higher confidence corresponds to higher accuracy). Calibration is beneficial for human-machine teams.
Figure 5: Fine-tuning a pretrained large language model (LLM) on neuroscience knowledge. Mistral-7B-v0.1 was fine-tuned using LoRA on neuroscience articles from 2002-2022 (a total of $1.3$ billion tokens). (A) The fine-tuned model improved by $3\%$ on BrainBench. (B) The fine-tuning process substantially shifted the perplexity distribution of correct responses, indicative of the LLM specializing in neuroscience.
...and 25 more figures

Large language models surpass human experts in predicting neuroscience results

TL;DR

Abstract

Large language models surpass human experts in predicting neuroscience results

Authors

TL;DR

Abstract

Table of Contents

Figures (30)