Table of Contents
Fetching ...

Automating Early Disease Prediction Via Structured and Unstructured Clinical Data

Ane G Domingo-Aldama, Marcos Merino Prado, Alain García Olea, Josu Goikoetxea, Koldo Gojenola, Aitziber Atutxa

Abstract

This study presents a fully automated methodology for early prediction studies in clinical settings, leveraging information extracted from unstructured discharge reports. The proposed pipeline uses discharge reports to support the three main steps of early prediction: cohort selection, dataset generation, and outcome labeling. By processing discharge reports with natural language processing techniques, we can efficiently identify relevant patient cohorts, enrich structured datasets with additional clinical variables, and generate high-quality labels without manual intervention. This approach addresses the frequent issue of missing or incomplete data in codified electronic health records (EHR), capturing clinically relevant information that is often underrepresented. We evaluate the methodology in the context of predicting atrial fibrillation (AF) progression, showing that predictive models trained on datasets enriched with discharge report information achieve higher accuracy and correlation with true outcomes compared to models trained solely on structured EHR data, while also surpassing traditional clinical scores. These results demonstrate that automating the integration of unstructured clinical text can streamline early prediction studies, improve data quality, and enhance the reliability of predictive models for clinical decision-making.

Automating Early Disease Prediction Via Structured and Unstructured Clinical Data

Abstract

This study presents a fully automated methodology for early prediction studies in clinical settings, leveraging information extracted from unstructured discharge reports. The proposed pipeline uses discharge reports to support the three main steps of early prediction: cohort selection, dataset generation, and outcome labeling. By processing discharge reports with natural language processing techniques, we can efficiently identify relevant patient cohorts, enrich structured datasets with additional clinical variables, and generate high-quality labels without manual intervention. This approach addresses the frequent issue of missing or incomplete data in codified electronic health records (EHR), capturing clinically relevant information that is often underrepresented. We evaluate the methodology in the context of predicting atrial fibrillation (AF) progression, showing that predictive models trained on datasets enriched with discharge report information achieve higher accuracy and correlation with true outcomes compared to models trained solely on structured EHR data, while also surpassing traditional clinical scores. These results demonstrate that automating the integration of unstructured clinical text can streamline early prediction studies, improve data quality, and enhance the reliability of predictive models for clinical decision-making.

Paper Structure

This paper contains 15 sections, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Clinical scenario of AF progression.
  • Figure 4: End to end overview of the proposed methodology. The pipeline starts with automatic cohort selection, followed by dataset generation by combining structured EHR data with information extracted from clinical reports. An NLP module performs automatic labeling, and the resulting silver and gold datasets are used to train and evaluate a TabPFN model for AF progression prediction.
  • Figure 5: Overview of the vector generation process. For each patient in the AF onset cohort, all discharge reports (free text) and codified data (structured data stored in the Business Intelligence system) are collected and processed using the Report2Vector (R2V) and Structured2Vector (S2V) tools, respectively. Each tool generates a corresponding set of vectors, which are then merged by the VectorMerger (VM) tool to produce a patient-specific vector that integrates both sources of clinical information.
  • Figure 6: Amount of missing values. Difference in percentage between the original and enriched datasets. The bar plots illustrate the recovery of features that were absent in the original codified dataset but retrieved from the information contained in the discharge reports.