Table of Contents
Fetching ...

Evaluation of Clinical Trials Reporting Quality using Large Language Models

Mathieu Laï-king, Patrick Paroubek

TL;DR

This study tackles automated evaluation of clinical trial reporting quality by using large language models to assess CONSORT-abstract criteria. It adapts two English CONSORT-abstract corpora into a QA task and evaluates multiple prompting strategies, including Chain-of-Thought, across several transformer-based systems. The best result (Mixtral-8x22B with 5-shot-CoT) reaches about 85% micro-averaged accuracy, with explanations offering transparency into reasoning. The work demonstrates feasibility for automated abstract-quality assessment while highlighting data limitations, model adaptation questions, and the need for future expansion to other standards and full-text evaluation.

Abstract

Reporting quality is an important topic in clinical trial research articles, as it can impact clinical decisions. In this article, we test the ability of large language models to assess the reporting quality of this type of article using the Consolidated Standards of Reporting Trials (CONSORT). We create CONSORT-QA, an evaluation corpus from two studies on abstract reporting quality with CONSORT-abstract standards. We then evaluate the ability of different large generative language models (from the general domain or adapted to the biomedical domain) to correctly assess CONSORT criteria with different known prompting methods, including Chain-of-thought. Our best combination of model and prompting method achieves 85% accuracy. Using Chain-of-thought adds valuable information on the model's reasoning for completing the task.

Evaluation of Clinical Trials Reporting Quality using Large Language Models

TL;DR

This study tackles automated evaluation of clinical trial reporting quality by using large language models to assess CONSORT-abstract criteria. It adapts two English CONSORT-abstract corpora into a QA task and evaluates multiple prompting strategies, including Chain-of-Thought, across several transformer-based systems. The best result (Mixtral-8x22B with 5-shot-CoT) reaches about 85% micro-averaged accuracy, with explanations offering transparency into reasoning. The work demonstrates feasibility for automated abstract-quality assessment while highlighting data limitations, model adaptation questions, and the need for future expansion to other standards and full-text evaluation.

Abstract

Reporting quality is an important topic in clinical trial research articles, as it can impact clinical decisions. In this article, we test the ability of large language models to assess the reporting quality of this type of article using the Consolidated Standards of Reporting Trials (CONSORT). We create CONSORT-QA, an evaluation corpus from two studies on abstract reporting quality with CONSORT-abstract standards. We then evaluate the ability of different large generative language models (from the general domain or adapted to the biomedical domain) to correctly assess CONSORT criteria with different known prompting methods, including Chain-of-thought. Our best combination of model and prompting method achieves 85% accuracy. Using Chain-of-thought adds valuable information on the model's reasoning for completing the task.

Paper Structure

This paper contains 33 sections, 10 figures, 3 tables.

Figures (10)

  • Figure 1: Example of generations for our "Few-Shot Chain-of-Thought" strategy with 1,3, or 5 examples using the Mixtral-8x22B-Instruct model for an abstract of the CONSORT-QA-Depression corpus annotated by experts as correctly reported for the eligibility mention criterion
  • Figure 7: Pearson correlation between model performance by criterion and the difficulty of these criteria
  • Figure 9: Qualitative error analysis for explanations of the Mixtral-8x22B model
  • Figure : Here, article abstracts have been shortened for space reasons, and we provide only one example (but for few-shot strategies, we provide several).
  • Figure : X-axis : model name and size | Blue bars : General models | Green bars: Biomedical models
  • ...and 5 more figures