Table of Contents
Fetching ...

Scaling Clinician-Grade Feature Generation from Clinical Notes with Multi-Agent Language Models

Jiayi Wang, Jacqueline Jil Vallon, Nikhil V. Kotha, Neil Panjwani, Xi Ling, Margaret Redfield, Sushmita Vij, Sandy Srinivas, John Leppert, Mark K. Buyyounouski, Mohsen Bayati

TL;DR

This study tackles the bottleneck of converting unstructured EHR notes into actionable features for clinical prediction. It introduces SNOW, a modular multi-agent LLM workflow that scalably replicates expert Clinician Feature Generation (CFG) while preserving interpretability through auditable intermediate artifacts. In a prostate cancer cohort (n=147) with five-year recurrence as the endpoint, SNOW achieves performance on par with manual CFG and surpasses Representational Feature Generation baselines, while reducing human effort by approximately 48-fold. External validation on heart-failure with preserved ejection fraction (n=2,084) using discharge summaries demonstrates SNOW’s generalizability, with Baseline + SNOW delivering the strongest 30-day and 1-year mortality predictions, confirming the approach’s practical potential for multimodal, cross-domain deployment and reproducible clinical research.

Abstract

Developing accurate clinical prediction models is often bottlenecked by the difficulty of deriving meaningful structured features from unstructured EHR notes, a process that traditionally requires manual, unscalable clinical abstraction. In this study, we first established a rigorous patient-level Clinician Feature Generation (CFG) protocol, in which domain experts manually reviewed notes to define and extract nuanced features for a cohort of 147 patients with prostate cancer. As a high-fidelity ground truth, this labor-intensive process provided the blueprint for SNOW (Scalable Note-to-Outcome Workflow), a transparent multi-agent large language model (LLM) system designed to autonomously mimic the iterative reasoning and validation workflow of clinical experts. On 5-year cancer recurrence prediction, SNOW (AUC-ROC 0.767) achieved performance comparable to manual CFG (0.762) and outperformed structured baselines, clinician-guided LLM extraction, and six representational feature generation (RFG) approaches. Once configured, SNOW produced the full patient-level feature table in 12 hours with 5 hours of clinician oversight, reducing human expert effort by approximately 48-fold versus manual CFG. To test scalability where manual CFG is infeasible, we deployed SNOW on an external heart failure with preserved ejection fraction (HFpEF) cohort from MIMIC-IV (n=2,084); without task-specific tuning, SNOW generated prognostic features that outperformed baseline and RFG methods for 30-day (SNOW: 0.851) and 1-year (SNOW: 0.763) mortality prediction. These results demonstrate that a modular LLM agent-based system can scale expert-level feature generation from clinical notes, while enabling interpretable use of unstructured EHR text in outcome prediction and preserving generalizability across a variety of settings and conditions.

Scaling Clinician-Grade Feature Generation from Clinical Notes with Multi-Agent Language Models

TL;DR

This study tackles the bottleneck of converting unstructured EHR notes into actionable features for clinical prediction. It introduces SNOW, a modular multi-agent LLM workflow that scalably replicates expert Clinician Feature Generation (CFG) while preserving interpretability through auditable intermediate artifacts. In a prostate cancer cohort (n=147) with five-year recurrence as the endpoint, SNOW achieves performance on par with manual CFG and surpasses Representational Feature Generation baselines, while reducing human effort by approximately 48-fold. External validation on heart-failure with preserved ejection fraction (n=2,084) using discharge summaries demonstrates SNOW’s generalizability, with Baseline + SNOW delivering the strongest 30-day and 1-year mortality predictions, confirming the approach’s practical potential for multimodal, cross-domain deployment and reproducible clinical research.

Abstract

Developing accurate clinical prediction models is often bottlenecked by the difficulty of deriving meaningful structured features from unstructured EHR notes, a process that traditionally requires manual, unscalable clinical abstraction. In this study, we first established a rigorous patient-level Clinician Feature Generation (CFG) protocol, in which domain experts manually reviewed notes to define and extract nuanced features for a cohort of 147 patients with prostate cancer. As a high-fidelity ground truth, this labor-intensive process provided the blueprint for SNOW (Scalable Note-to-Outcome Workflow), a transparent multi-agent large language model (LLM) system designed to autonomously mimic the iterative reasoning and validation workflow of clinical experts. On 5-year cancer recurrence prediction, SNOW (AUC-ROC 0.767) achieved performance comparable to manual CFG (0.762) and outperformed structured baselines, clinician-guided LLM extraction, and six representational feature generation (RFG) approaches. Once configured, SNOW produced the full patient-level feature table in 12 hours with 5 hours of clinician oversight, reducing human expert effort by approximately 48-fold versus manual CFG. To test scalability where manual CFG is infeasible, we deployed SNOW on an external heart failure with preserved ejection fraction (HFpEF) cohort from MIMIC-IV (n=2,084); without task-specific tuning, SNOW generated prognostic features that outperformed baseline and RFG methods for 30-day (SNOW: 0.851) and 1-year (SNOW: 0.763) mortality prediction. These results demonstrate that a modular LLM agent-based system can scale expert-level feature generation from clinical notes, while enabling interpretable use of unstructured EHR text in outcome prediction and preserving generalizability across a variety of settings and conditions.

Paper Structure

This paper contains 43 sections, 18 figures, 6 tables.

Figures (18)

  • Figure 1: Systematic framework for scalable, expert-level clinical feature generation.(a), Overview of the clinical prediction pipeline highlighting the bottleneck of unstructured text (left), Summary of the dual-cohort evaluation strategy: the primary cohort (N=147) serves as a high-fidelity 'calibration testbed' to benchmark agentic reasoning against a high-resolution human ground truth, while the external cohort (N=2,084) serves as a 'deployment testbed' to validate scalability in a real-world setting where manual ground truth generation is intractable (right). (b), Comparison of the manual Clinician Feature Generation (CFG) protocol versus SNOW system.
  • Figure 2: Systematic workflow for patient-level clinician feature generation (CFG). Stepwise pipeline used by oncologists and data scientists to translate domain expertise into structured features, including defining clinical concepts based on expertise (left), defining associated features by reviewing patient note samples (center), and manually extracting features requiring per-patient parsing and clinical input (right) to create the final curated dataset.
  • Figure 3: Obstacles in extraction of patient-level CFG. Representative excerpts from biopsy and pathology notes demonstrating challenges for rule-based extraction, such as varying region names, inconsistent reporting of results per core, diverse formatting, and irregular negation phrasing. These examples motivate the need for flexible, context-aware methods such as SNOW to reliably recover clinically meaningful features from unstructured text.
  • Figure 4: Architecture of SNOW, a modular multi-agent Scalable Note-to-Outcome Workflow. SNOW decomposes feature generation into specialized LLM agents: the Proposal and Alignment Agents for feature definition, and the Extraction and Validation Agents operating in a loop to refine extraction logic. The Aggregation Code Generator compiles aggregated features from extracted values. Arrows indicate information flow between agents. The system allows for human oversight at each stage, enabling experts to review, refine, or override intermediate outputs.
  • Figure 5: Performance of feature generation methods for 5-year prostate cancer recurrence prediction. Distributions of area under the receiver operating characteristic curve (AUC-ROC) across 50 repetitions of nested cross-validation for regularized logistic regression, $k$-nearest neighbors, and random feature models trained with different feature sets: Baseline only, Baseline + Bag-of-Words (BoW) TF–IDF, Baseline + BoW TF–IDF, Baseline + CLFG, Baseline + SNOW, and Baseline + CFG. CFG substantially improves performance relative to Baseline, and SNOW achieves AUC-ROC comparable to CFG while outperforming all RFG approaches, indicating that task-adapted agentic feature generation better harvests prognostic signal from notes than off-the-shelf embeddings. Among RFG methods, BoW TF-IDF is the best-performing non–LLM informed variant and BoW TF-IDF (LLM-informed) is the best-performing LLM-informed RFG variant; a full comparison of all RFG methods is provided in Appendix \ref{['appendix:rfg_comparison']}.
  • ...and 13 more figures