Table of Contents
Fetching ...

ROBoto2: An Interactive System and Dataset for LLM-assisted Clinical Trial Risk of Bias Assessment

Anthony Hevia, Sanjana Chintalapati, Veronica Ka Wai Lai, Thanh Tam Nguyen, Wai-Tat Wong, Terry Klassen, Lucy Lu Wang

TL;DR

ROBOTO2 introduces an open-source, web-based platform that couples PDF preprocessing, retrieval-augmented LLM prompting, and human-in-the-loop review to support ROB2 risk-of-bias assessment of clinical trials. It provides a practical workflow for signaling-question answering, evidence retrieval, and domain/overall risk judgments, while releasing a 521-trial pediatric dataset with 8954 signaling questions and 1202 evidence passages. The authors benchmark four LLMs and analyze retrieval strategies, highlighting that current models remain conservative and struggle with fully automated ROB2 judgments, underscoring the continued need for human validation. The work delivers both a usable tool for researchers and a valuable benchmark for advancing LL-assisted systematic reviews in pediatric research, with implications for reproducibility and efficiency in evidence synthesis.

Abstract

We present ROBOTO2, an open-source, web-based platform for large language model (LLM)-assisted risk of bias (ROB) assessment of clinical trials. ROBOTO2 streamlines the traditionally labor-intensive ROB v2 (ROB2) annotation process via an interactive interface that combines PDF parsing, retrieval-augmented LLM prompting, and human-in-the-loop review. Users can upload clinical trial reports, receive preliminary answers and supporting evidence for ROB2 signaling questions, and provide real-time feedback or corrections to system suggestions. ROBOTO2 is publicly available at https://roboto2.vercel.app/, with code and data released to foster reproducibility and adoption. We construct and release a dataset of 521 pediatric clinical trial reports (8954 signaling questions with 1202 evidence passages), annotated using both manually and LLM-assisted methods, serving as a benchmark and enabling future research. Using this dataset, we benchmark ROB2 performance for 4 LLMs and provide an analysis into current model capabilities and ongoing challenges in automating this critical aspect of systematic review.

ROBoto2: An Interactive System and Dataset for LLM-assisted Clinical Trial Risk of Bias Assessment

TL;DR

ROBOTO2 introduces an open-source, web-based platform that couples PDF preprocessing, retrieval-augmented LLM prompting, and human-in-the-loop review to support ROB2 risk-of-bias assessment of clinical trials. It provides a practical workflow for signaling-question answering, evidence retrieval, and domain/overall risk judgments, while releasing a 521-trial pediatric dataset with 8954 signaling questions and 1202 evidence passages. The authors benchmark four LLMs and analyze retrieval strategies, highlighting that current models remain conservative and struggle with fully automated ROB2 judgments, underscoring the continued need for human validation. The work delivers both a usable tool for researchers and a valuable benchmark for advancing LL-assisted systematic reviews in pediatric research, with implications for reproducibility and efficiency in evidence synthesis.

Abstract

We present ROBOTO2, an open-source, web-based platform for large language model (LLM)-assisted risk of bias (ROB) assessment of clinical trials. ROBOTO2 streamlines the traditionally labor-intensive ROB v2 (ROB2) annotation process via an interactive interface that combines PDF parsing, retrieval-augmented LLM prompting, and human-in-the-loop review. Users can upload clinical trial reports, receive preliminary answers and supporting evidence for ROB2 signaling questions, and provide real-time feedback or corrections to system suggestions. ROBOTO2 is publicly available at https://roboto2.vercel.app/, with code and data released to foster reproducibility and adoption. We construct and release a dataset of 521 pediatric clinical trial reports (8954 signaling questions with 1202 evidence passages), annotated using both manually and LLM-assisted methods, serving as a benchmark and enabling future research. Using this dataset, we benchmark ROB2 performance for 4 LLMs and provide an analysis into current model capabilities and ongoing challenges in automating this critical aspect of systematic review.

Paper Structure

This paper contains 36 sections, 4 figures, 8 tables.

Figures (4)

  • Figure 1: ROBoto2 system pipeline. Given a clinical trial PDF as input, ROBoto2 first preprocesses the document to extract and embed paragraphs. Then, a QA module iterates through all of the questions of the ROB2 assessment to identify evidence passages and prompt GPT3.5 to answer the question based on the retrieved evidence.
  • Figure 2: Screenshot of ROBoto2 assisting with a question from Domain 2. The user can modify the model-provided answer and explanation and rate reference paragraphs.
  • Figure 3: Flowchart for how answers to signaling questions contribute to a domain-level judgment for Domain 4 in the ROB2 tool. Reproduced from https://sites.google.com/site/riskofbiastool/welcome/rob-2-0-tool.
  • Figure 4: Stacked bar chart showcasing the aggregate true positive (TP) classifications versus false positive/negative (FP/FN) errors made by each model. FPs and FNs are each broken down into two classes, where class 1 (lighter color) are milder errors than class 2 (darker color) (e.g., misclassifying NI and N/PN or Y/PY is less severe than misclassifying N/PN as Y/PY or vice versa). Counts less than 3 have their numbers hidden for chart readability, and full counts are available in Table \ref{['tab:model-comparison']}.