Evaluation of Clinical Trials Reporting Quality using Large Language Models
Mathieu Laï-king, Patrick Paroubek
TL;DR
This study tackles automated evaluation of clinical trial reporting quality by using large language models to assess CONSORT-abstract criteria. It adapts two English CONSORT-abstract corpora into a QA task and evaluates multiple prompting strategies, including Chain-of-Thought, across several transformer-based systems. The best result (Mixtral-8x22B with 5-shot-CoT) reaches about 85% micro-averaged accuracy, with explanations offering transparency into reasoning. The work demonstrates feasibility for automated abstract-quality assessment while highlighting data limitations, model adaptation questions, and the need for future expansion to other standards and full-text evaluation.
Abstract
Reporting quality is an important topic in clinical trial research articles, as it can impact clinical decisions. In this article, we test the ability of large language models to assess the reporting quality of this type of article using the Consolidated Standards of Reporting Trials (CONSORT). We create CONSORT-QA, an evaluation corpus from two studies on abstract reporting quality with CONSORT-abstract standards. We then evaluate the ability of different large generative language models (from the general domain or adapted to the biomedical domain) to correctly assess CONSORT criteria with different known prompting methods, including Chain-of-thought. Our best combination of model and prompting method achieves 85% accuracy. Using Chain-of-thought adds valuable information on the model's reasoning for completing the task.
