Can artificial intelligence predict clinical trial outcomes?
Shuyi Jin, Lu Chen, Hongru Ding, Meijie Wang, Lun Yu
TL;DR
This study evaluates whether large language models (LLMs) and the HINT model can predict clinical trial outcomes using a ClinicalTrials.gov–derived dataset across phases, disease categories, endpoints, and durations. It finds that GPT-4o provides the strongest overall performance among LLMs but struggles with identifying negative outcomes, while HINT offers robust negative-sample recognition and resilience to recruitment noise, albeit with limitations in oncology trials. The results reveal complementary strengths: LLMs excel in early-phase and simpler endpoints (e.g., OS), whereas HINT delivers consistent performance and higher specificity, suggesting that hybrid approaches could yield more robust trial-outcome forecasts. The work highlights practical implications for risk prediction in drug development and underscores the need to account for terminated trials and external factors when deploying AI-based forecasting. Overall, the findings support integrating LLM breadth with HINT’s precision to improve clinical trial outcome forecasting in real-world settings.
Abstract
This study evaluates the performance of large language models (LLMs) and the HINT model in predicting clinical trial outcomes, focusing on metrics including Balanced Accuracy, Matthews Correlation Coefficient (MCC), Recall, and Specificity. Results show that GPT-4o achieves superior overall performance among LLMs but, like its counterparts (GPT-3.5, GPT-4mini, Llama3), struggles with identifying negative outcomes. In contrast, HINT excels in negative sample recognition and demonstrates resilience to external factors (e.g., recruitment challenges) but underperforms in oncology trials, a major component of the dataset. LLMs exhibit strengths in early-phase trials and simpler endpoints like Overall Survival (OS), while HINT shows consistency across trial phases and excels in complex endpoints (e.g., Objective Response Rate). Trial duration analysis reveals improved model performance for medium- to long-term trials, with GPT-4o and HINT displaying stability and enhanced specificity, respectively. We underscore the complementary potential of LLMs (e.g., GPT-4o, Llama3) and HINT, advocating for hybrid approaches to leverage GPT-4o's predictive power and HINT's specificity in clinical trial outcome forecasting.
