Table of Contents
Fetching ...

Measuring Mid-2025 LLM-Assistance on Novice Performance in Biology

Shen Zhou Hong, Alex Kleinman, Alyssa Mathiowetz, Adam Howes, Julian Cohen, Suveer Ganta, Alex Letizia, Dora Liao, Deepika Pahari, Xavier Roberts-Gaal, Luca Righetti, Joe Torres

TL;DR

This study evaluates whether mid-2025 frontier LLMs improve novice performance in a physically executed reverse genetics workflow. In a preregistered, investigator-blinded RCT (n=153) conducted in a BSL-2 lab, LLM access did not significantly increase the primary completion rate of core tasks, though cell culture showed a notable positive trend and time-to-progress measurements favored the LLM arm. Post-hoc Bayesian pooling suggests a modest uplift for a typical reverse genetics task under LLM guidance, while task-level outcomes remained underpowered due to low completion rates. The findings reveal a gap between bench-scale benchmarks and real-world lab performance, underscoring the need for physical-world validation and improved interfaces or prompting strategies to better harness tacit knowledge in novices. Overall, LLM assistance may modestly accelerate procedural progression but does not substantially boost end-to-end completion within the study timeframe, emphasizing careful, evidence-based assessment of AI-enabled biosecurity risks in practical settings.

Abstract

Large language models (LLMs) perform strongly on biological benchmarks, raising concerns that they may help novice actors acquire dual-use laboratory skills. Yet, whether this translates to improved human performance in the physical laboratory remains unclear. To address this, we conducted a pre-registered, investigator-blinded, randomized controlled trial (June-August 2025; n = 153) evaluating whether LLMs improve novice performance in tasks that collectively model a viral reverse genetics workflow. We observed no significant difference in the primary endpoint of workflow completion (5.2% LLM vs. 6.6% Internet; P = 0.759), nor in the success rate of individual tasks. However, the LLM arm had numerically higher success rates in four of the five tasks, most notably for the cell culture task (68.8% LLM vs. 55.3% Internet; P = 0.059). Post-hoc Bayesian modeling of pooled data estimates an approximate 1.4-fold increase (95% CrI 0.74-2.62) in success for a "typical" reverse genetics task under LLM assistance. Ordinal regression modelling suggests that participants in the LLM arm were more likely to progress through intermediate steps across all tasks (posterior probability of a positive effect: 81%-96%). Overall, mid-2025 LLMs did not substantially increase novice completion of complex laboratory procedures but were associated with a modest performance benefit. These results reveal a gap between in silico benchmarks and real-world utility, underscoring the need for physical-world validation of AI biosecurity assessments as model capabilities and user proficiency evolve.

Measuring Mid-2025 LLM-Assistance on Novice Performance in Biology

TL;DR

This study evaluates whether mid-2025 frontier LLMs improve novice performance in a physically executed reverse genetics workflow. In a preregistered, investigator-blinded RCT (n=153) conducted in a BSL-2 lab, LLM access did not significantly increase the primary completion rate of core tasks, though cell culture showed a notable positive trend and time-to-progress measurements favored the LLM arm. Post-hoc Bayesian pooling suggests a modest uplift for a typical reverse genetics task under LLM guidance, while task-level outcomes remained underpowered due to low completion rates. The findings reveal a gap between bench-scale benchmarks and real-world lab performance, underscoring the need for physical-world validation and improved interfaces or prompting strategies to better harness tacit knowledge in novices. Overall, LLM assistance may modestly accelerate procedural progression but does not substantially boost end-to-end completion within the study timeframe, emphasizing careful, evidence-based assessment of AI-enabled biosecurity risks in practical settings.

Abstract

Large language models (LLMs) perform strongly on biological benchmarks, raising concerns that they may help novice actors acquire dual-use laboratory skills. Yet, whether this translates to improved human performance in the physical laboratory remains unclear. To address this, we conducted a pre-registered, investigator-blinded, randomized controlled trial (June-August 2025; n = 153) evaluating whether LLMs improve novice performance in tasks that collectively model a viral reverse genetics workflow. We observed no significant difference in the primary endpoint of workflow completion (5.2% LLM vs. 6.6% Internet; P = 0.759), nor in the success rate of individual tasks. However, the LLM arm had numerically higher success rates in four of the five tasks, most notably for the cell culture task (68.8% LLM vs. 55.3% Internet; P = 0.059). Post-hoc Bayesian modeling of pooled data estimates an approximate 1.4-fold increase (95% CrI 0.74-2.62) in success for a "typical" reverse genetics task under LLM assistance. Ordinal regression modelling suggests that participants in the LLM arm were more likely to progress through intermediate steps across all tasks (posterior probability of a positive effect: 81%-96%). Overall, mid-2025 LLMs did not substantially increase novice completion of complex laboratory procedures but were associated with a modest performance benefit. These results reveal a gap between in silico benchmarks and real-world utility, underscoring the need for physical-world validation of AI biosecurity assessments as model capabilities and user proficiency evolve.
Paper Structure (37 sections, 11 figures, 19 tables)

This paper contains 37 sections, 11 figures, 19 tables.

Figures (11)

  • Figure 1: Trial Design. Schematic of the 8-week in-person study. Participants (n = 153) completed safety and LLM training prior to randomization and the start of laboratory work (Session 1). Participants completed a workflow consisting of a foundational skill assessment (Pre-task 1) followed by the core reverse genetics sequence (Tasks 2--4) and RNA quantification (Task 5). Baseline surveys were collected at Session 5; outcome measures (task completion) and tool utilization (chat logs, search history) were recorded continuously throughout the 39 sessions.
  • Figure 2: (a) Baseline participant characteristics. Stacked bar charts display the distribution of education level, academic field, and prior biology experience across the full cohort. (b) CONSORT flow diagram illustrating participant allocation, attrition, and analysis sets. Of the 153 randomized participants (Full Analysis Set (FAS)), 128 (84%) met the attendance criteria ($\geq 35$ sessions) for inclusion in the Per-Protocol Set (PPS).
  • Figure 3: Task Success Rates and Pooled Effect Estimates. Forest plot displaying success rates expressed as Risk Ratios (RR = LLM/INT). Black markers represent observed RRs with 95% confidence intervals (CI) calculated using the Koopman score method; P values are derived from one-sided Fisher's exact tests. Purple markers represent posterior estimates from a hierarchical Bayesian logistic regression model, displaying posterior means, 95% credible intervals (CrI), and the posterior probability of a positive effect ($\Pr(RR) > 1$). Shaded regions depict full posterior densities. Out-of-sample: Posterior distribution of the predicted RR for a hypothetical, out-of-sample reverse genetics task. The vertical dashed line at $x = 1$ indicates no effect.
  • Figure 4: Time-to-Completion per Task. Kaplan-Meier cumulative incidence curves illustrating the probability of successful completion over the study duration for the indicated task or outcome. The Internet arm is shown in pink ($n = 76$) and the LLM arm in blue ($n = 77$). Shaded regions denote 95% confidence intervals. Vertical dashed lines indicate the scheduled release date for each specific task, prior to which no attempts were permitted. Pink and blue values indicate numbers at risk, per arm.
  • Figure 5: Progress through procedural steps is higher in the LLM arm.(a) Stepwise survival curves showing participant attrition at each procedural substep for Tasks 2--5. Solid lines represent observed completion rates for the Internet (pink) and LLM (blue) arms. (b) Bayesian ordinal regression estimates of LLM effects on progression. The Odds Ratio (OR) quantifies the increased likelihood of an LLM-arm participant reaching a more advanced procedural stage compared to an Internet-arm participant. Shaded regions depict full posterior densities; points and error bars indicate posterior means and 95% credible intervals (CrI).
  • ...and 6 more figures