Table of Contents
Fetching ...

Can artificial intelligence predict clinical trial outcomes?

Shuyi Jin, Lu Chen, Hongru Ding, Meijie Wang, Lun Yu

TL;DR

This study evaluates whether large language models (LLMs) and the HINT model can predict clinical trial outcomes using a ClinicalTrials.gov–derived dataset across phases, disease categories, endpoints, and durations. It finds that GPT-4o provides the strongest overall performance among LLMs but struggles with identifying negative outcomes, while HINT offers robust negative-sample recognition and resilience to recruitment noise, albeit with limitations in oncology trials. The results reveal complementary strengths: LLMs excel in early-phase and simpler endpoints (e.g., OS), whereas HINT delivers consistent performance and higher specificity, suggesting that hybrid approaches could yield more robust trial-outcome forecasts. The work highlights practical implications for risk prediction in drug development and underscores the need to account for terminated trials and external factors when deploying AI-based forecasting. Overall, the findings support integrating LLM breadth with HINT’s precision to improve clinical trial outcome forecasting in real-world settings.

Abstract

This study evaluates the performance of large language models (LLMs) and the HINT model in predicting clinical trial outcomes, focusing on metrics including Balanced Accuracy, Matthews Correlation Coefficient (MCC), Recall, and Specificity. Results show that GPT-4o achieves superior overall performance among LLMs but, like its counterparts (GPT-3.5, GPT-4mini, Llama3), struggles with identifying negative outcomes. In contrast, HINT excels in negative sample recognition and demonstrates resilience to external factors (e.g., recruitment challenges) but underperforms in oncology trials, a major component of the dataset. LLMs exhibit strengths in early-phase trials and simpler endpoints like Overall Survival (OS), while HINT shows consistency across trial phases and excels in complex endpoints (e.g., Objective Response Rate). Trial duration analysis reveals improved model performance for medium- to long-term trials, with GPT-4o and HINT displaying stability and enhanced specificity, respectively. We underscore the complementary potential of LLMs (e.g., GPT-4o, Llama3) and HINT, advocating for hybrid approaches to leverage GPT-4o's predictive power and HINT's specificity in clinical trial outcome forecasting.

Can artificial intelligence predict clinical trial outcomes?

TL;DR

This study evaluates whether large language models (LLMs) and the HINT model can predict clinical trial outcomes using a ClinicalTrials.gov–derived dataset across phases, disease categories, endpoints, and durations. It finds that GPT-4o provides the strongest overall performance among LLMs but struggles with identifying negative outcomes, while HINT offers robust negative-sample recognition and resilience to recruitment noise, albeit with limitations in oncology trials. The results reveal complementary strengths: LLMs excel in early-phase and simpler endpoints (e.g., OS), whereas HINT delivers consistent performance and higher specificity, suggesting that hybrid approaches could yield more robust trial-outcome forecasts. The work highlights practical implications for risk prediction in drug development and underscores the need to account for terminated trials and external factors when deploying AI-based forecasting. Overall, the findings support integrating LLM breadth with HINT’s precision to improve clinical trial outcome forecasting in real-world settings.

Abstract

This study evaluates the performance of large language models (LLMs) and the HINT model in predicting clinical trial outcomes, focusing on metrics including Balanced Accuracy, Matthews Correlation Coefficient (MCC), Recall, and Specificity. Results show that GPT-4o achieves superior overall performance among LLMs but, like its counterparts (GPT-3.5, GPT-4mini, Llama3), struggles with identifying negative outcomes. In contrast, HINT excels in negative sample recognition and demonstrates resilience to external factors (e.g., recruitment challenges) but underperforms in oncology trials, a major component of the dataset. LLMs exhibit strengths in early-phase trials and simpler endpoints like Overall Survival (OS), while HINT shows consistency across trial phases and excels in complex endpoints (e.g., Objective Response Rate). Trial duration analysis reveals improved model performance for medium- to long-term trials, with GPT-4o and HINT displaying stability and enhanced specificity, respectively. We underscore the complementary potential of LLMs (e.g., GPT-4o, Llama3) and HINT, advocating for hybrid approaches to leverage GPT-4o's predictive power and HINT's specificity in clinical trial outcome forecasting.

Paper Structure

This paper contains 23 sections, 2 figures, 8 tables.

Figures (2)

  • Figure 1: A: The distribution of diseases across different model datasets. B:Day difference distribution by difference classification C: Distribution of clinical trial outcomes across different model datasets.
  • Figure 2: A: The line chart shows the cumulative amount of clinical data covered by each model up to the respective cut-off dates, with the horizontal axis representing the time points of the models and the vertical axis representing the cumulative data volume. B: Data distribution across different clinical trial phases and trial statuses. C: Different labels in different study states: This figure describes the input data structures and output results for different models predicting clinical trial outcomes. The upper part shows the input features used by models. Based on these features, the models predict whether a trial is a "Success" or "Failure." E: This figure shows the input features used by the HINT model. The HINT model determines the final label by predicting the probability of success. F: Word cloud for clinical trials terminated reasons.