Cost-Aware Prediction (CAP): An LLM-Enhanced Machine Learning Pipeline and Decision Support System for Heart Failure Mortality Prediction
Yinan Yu, Falk Dippel, Christina E. Lundberg, Martin Lindgren, Annika Rosengren, Martin Adiels, Helen Sjöland
TL;DR
This work addresses the gap between predictive accuracy and downstream clinical value in heart failure mortality prediction by proposing the Cost-Aware Prediction (CAP) framework, which integrates an ML classifier with clinical impact projection (CIP) cost curves and a four-agent, large language model (LLM)–driven cost-benefit analysis to support decision-making. The method achieves a best-performing gradient-boosting model with AUROC $=0.804$ and AUPRC $=0.529$, while CIP curves reveal how different decision thresholds affect patient QoL and healthcare expenditures. The novel contribution lies in combining population-level cost visualization with patient-level, LLM-generated interpretations to elucidate trade-offs and improve interpretability and trust. The study demonstrates that CAP’s three-stage pipeline enables more transparent, cost-aware, and potentially policy-influencing decision support for home-care eligibility in heart failure, albeit with a need for more robust handling of speculative outputs from LLM agents.
Abstract
Objective: Machine learning (ML) predictive models are often developed without considering downstream value trade-offs and clinical interpretability. This paper introduces a cost-aware prediction (CAP) framework that combines cost-benefit analysis assisted by large language model (LLM) agents to communicate the trade-offs involved in applying ML predictions. Materials and Methods: We developed an ML model predicting 1-year mortality in patients with heart failure (N = 30,021, 22% mortality) to identify those eligible for home care. We then introduced clinical impact projection (CIP) curves to visualize important cost dimensions - quality of life and healthcare provider expenses, further divided into treatment and error costs, to assess the clinical consequences of predictions. Finally, we used four LLM agents to generate patient-specific descriptions. The system was evaluated by clinicians for its decision support value. Results: The eXtreme gradient boosting (XGB) model achieved the best performance, with an area under the receiver operating characteristic curve (AUROC) of 0.804 (95% confidence interval (CI) 0.792-0.816), area under the precision-recall curve (AUPRC) of 0.529 (95% CI 0.502-0.558) and a Brier score of 0.135 (95% CI 0.130-0.140). Discussion: The CIP cost curves provided a population-level overview of cost composition across decision thresholds, whereas LLM-generated cost-benefit analysis at individual patient-levels. The system was well received according to the evaluation by clinicians. However, feedback emphasizes the need to strengthen the technical accuracy for speculative tasks. Conclusion: CAP utilizes LLM agents to integrate ML classifier outcomes and cost-benefit analysis for more transparent and interpretable decision support.
