Evaluating the Ability of Large Language Models to Identify Adherence to CONSORT Reporting Guidelines in Randomized Controlled Trials: A Methodological Evaluation Study
Zhichao He, Mouxiao Bian, Jianhong Zhu, Jiayuan Chen, Yunqiu Wang, Wenxia Zhao, Tianbin Li, Bing Han, Jie Xu, Junyan Wu
TL;DR
This study evaluates whether current large language models can reliably identify adherence to the CONSORT 2010 reporting guidelines in randomized controlled trials under a zero-shot setting. Using a gold-standard benchmark (RCTBench) of $150$ full-text RCTs annotated across $37$ CONSORT items, sixteen LLMs are assessed on their ability to classify items as Compliant, Non-Compliant, or Not Applicable, with outputs structured as non_compliant_items and not_applicable_items in a JSON format. The top performers achieve a macro-$F_1$ of about $0.63$ and a Cohen's kappa of roughly $0.28$, but the models show a stark dichotomy: high accuracy for identifying compliant reporting yet poor performance for detecting omissions or not-applicable items (often $F_1<0.40$), and some strong models (e.g., GPT-4o) underperform relative to expectations. The authors conclude that while LLMs may assist in rapid screening, they are not yet reliable enough to replace human expertise in critical appraisal; a human-in-the-loop workflow remains necessary to safeguard the integrity of clinical evidence.
Abstract
The Consolidated Standards of Reporting Trials statement is the global benchmark for transparent and high-quality reporting of randomized controlled trials. Manual verification of CONSORT adherence is a laborious, time-intensive process that constitutes a significant bottleneck in peer review and evidence synthesis. This study aimed to systematically evaluate the accuracy and reliability of contemporary LLMs in identifying the adherence of published RCTs to the CONSORT 2010 statement under a zero-shot setting. We constructed a golden standard dataset of 150 published RCTs spanning diverse medical specialties. The primary outcome was the macro-averaged F1-score for the three-class classification task, supplemented by item-wise performance metrics and qualitative error analysis. Overall model performance was modest. The top-performing models, Gemini-2.5-Flash and DeepSeek-R1, achieved nearly identical macro F1 scores of 0.634 and Cohen's Kappa coefficients of 0.280 and 0.282, respectively, indicating only fair agreement with expert consensus. A striking performance disparity was observed across classes: while most models could identify compliant items with high accuracy (F1 score > 0.850), they struggled profoundly with identifying non-compliant and not applicable items, where F1 scores rarely exceeded 0.400. Notably, some high-profile models like GPT-4o underperformed, achieving a macro F1-score of only 0.521. LLMs show potential as preliminary screening assistants for CONSORT checks, capably identifying well-reported items. However, their current inability to reliably detect reporting omissions or methodological flaws makes them unsuitable for replacing human expertise in the critical appraisal of trial quality.
