Table of Contents
Fetching ...

Automated Risk-of-Bias Assessment of Randomized Controlled Trials: A First Look at a GEPA-trained Programmatic Prompting Framework

Lingbo Li, Anuradha Mathrani, Teo Susnjak

TL;DR

The paper targets the bottleneck of risk-of-bias assessment in randomized trials by introducing a programmatic prompting framework (GEPA) built on DSPy to automate Cochrane RoB1 judgments. It demonstrates that GEPA-generated prompts, optimized via Pareto-guided search, produce transparent, reproducible reasoning traces and often outperform manually crafted prompts, with competitive performance relative to commercial LLMs. Evaluated on 100 RCTs across seven RoB domains, the method achieves higher accuracy in several key domains and offers substantial efficiency gains, though domain-specific variability and the need for human oversight persist. The work argues for a scalable, auditable path to integrating LLMs into evidence synthesis, with broader implications for standardization and reproducibility in systematic reviews.

Abstract

Assessing risk of bias (RoB) in randomized controlled trials is essential for trustworthy evidence synthesis, but the process is resource-intensive and prone to variability across reviewers. Large language models (LLMs) offer a route to automation, but existing methods rely on manually engineered prompts that are difficult to reproduce, generalize, or evaluate. This study introduces a programmable RoB assessment pipeline that replaces ad-hoc prompt design with structured, code-based optimization using DSPy and its GEPA module. GEPA refines LLM reasoning through Pareto-guided search and produces inspectable execution traces, enabling transparent replication of every step in the optimization process. We evaluated the method on 100 RCTs from published meta-analyses across seven RoB domains. GEPA-generated prompts were applied to both open-weight models (Mistral Small 3.1 with GPT-oss-20b) and commercial models (GPT-5 Nano and GPT-5 Mini). In domains with clearer methodological reporting, such as Random Sequence Generation, GEPA-generated prompts performed best, with similar results for Allocation Concealment and Blinding of Participants, while the commercial model performed slightly better overall. We also compared GEPA with three manually designed prompts using Claude 3.5 Sonnet. GEPA achieved the highest overall accuracy and improved performance by 30%-40% in Random Sequence Generation and Selective Reporting, and showed generally comparable, competitively aligned performance in the other domains relative to manual prompts. These findings suggest that GEPA can produce consistent and reproducible prompts for RoB assessment, supporting the structured and principled use of LLMs in evidence synthesis.

Automated Risk-of-Bias Assessment of Randomized Controlled Trials: A First Look at a GEPA-trained Programmatic Prompting Framework

TL;DR

The paper targets the bottleneck of risk-of-bias assessment in randomized trials by introducing a programmatic prompting framework (GEPA) built on DSPy to automate Cochrane RoB1 judgments. It demonstrates that GEPA-generated prompts, optimized via Pareto-guided search, produce transparent, reproducible reasoning traces and often outperform manually crafted prompts, with competitive performance relative to commercial LLMs. Evaluated on 100 RCTs across seven RoB domains, the method achieves higher accuracy in several key domains and offers substantial efficiency gains, though domain-specific variability and the need for human oversight persist. The work argues for a scalable, auditable path to integrating LLMs into evidence synthesis, with broader implications for standardization and reproducibility in systematic reviews.

Abstract

Assessing risk of bias (RoB) in randomized controlled trials is essential for trustworthy evidence synthesis, but the process is resource-intensive and prone to variability across reviewers. Large language models (LLMs) offer a route to automation, but existing methods rely on manually engineered prompts that are difficult to reproduce, generalize, or evaluate. This study introduces a programmable RoB assessment pipeline that replaces ad-hoc prompt design with structured, code-based optimization using DSPy and its GEPA module. GEPA refines LLM reasoning through Pareto-guided search and produces inspectable execution traces, enabling transparent replication of every step in the optimization process. We evaluated the method on 100 RCTs from published meta-analyses across seven RoB domains. GEPA-generated prompts were applied to both open-weight models (Mistral Small 3.1 with GPT-oss-20b) and commercial models (GPT-5 Nano and GPT-5 Mini). In domains with clearer methodological reporting, such as Random Sequence Generation, GEPA-generated prompts performed best, with similar results for Allocation Concealment and Blinding of Participants, while the commercial model performed slightly better overall. We also compared GEPA with three manually designed prompts using Claude 3.5 Sonnet. GEPA achieved the highest overall accuracy and improved performance by 30%-40% in Random Sequence Generation and Selective Reporting, and showed generally comparable, competitively aligned performance in the other domains relative to manual prompts. These findings suggest that GEPA can produce consistent and reproducible prompts for RoB assessment, supporting the structured and principled use of LLMs in evidence synthesis.

Paper Structure

This paper contains 37 sections, 2 figures, 6 tables.

Figures (2)

  • Figure 1: GEPA-based automated RoB assessment workflow
  • Figure 2: Performance comparison of GEPA-based and manually designed prompts. Radar plots (A) show domain-level mean accuracy across all domains, and violin plots (B) summarize overall correct assessment rates