Table of Contents
Fetching ...

Patient-level Information Extraction by Consistent Integration of Textual and Tabular Evidence with Bayesian Networks

Paloma Rabaey, Adrick Tench, Stefan Heytens, Thomas Demeester

TL;DR

This work tackles patient-level information extraction from electronic health records by integrating structured tabular data through an expert-defined Bayesian network with neural classifiers that interpret unstructured clinical notes. The core innovation is the Consistency Node, which probabilistically fuses predictions from the BN and text classifiers alongside a Virtual Evidence mechanism, yielding better-calibrated, interpretable outputs and improved robustness to missing or shifted text information. Evaluated on the SimSUM dataset, the V-C-BN-text model consistently outperforms uni-modal and simple fusion baselines, particularly when text is incomplete or misleading, and maintains advantages under distribution shifts. The approach offers a flexible, interpretable framework for multi-modal information extraction with potential extensions to other modalities and broader clinical tasks, supported by code availability and conceptual generality.

Abstract

Electronic health records (EHRs) form an invaluable resource for training clinical decision support systems. To leverage the potential of such systems in high-risk applications, we need large, structured tabular datasets on which we can build transparent feature-based models. While part of the EHR already contains structured information (e.g. diagnosis codes, medications, and lab results), much of the information is contained within unstructured text (e.g. discharge summaries and nursing notes). In this work, we propose a method for multi-modal patient-level information extraction that leverages both the tabular features available in the patient's EHR (using an expert-informed Bayesian network) as well as clinical notes describing the patient's symptoms (using neural text classifiers). We propose the use of virtual evidence augmented with a consistency node to provide an interpretable, probabilistic fusion of the models' predictions. The consistency node improves the calibration of the final predictions compared to virtual evidence alone, allowing the Bayesian network to better adjust the neural classifier's output to handle missing information and resolve contradictions between the tabular and text data. We show the potential of our method on the SimSUM dataset, a simulated benchmark linking tabular EHRs with clinical notes through expert knowledge.

Patient-level Information Extraction by Consistent Integration of Textual and Tabular Evidence with Bayesian Networks

TL;DR

This work tackles patient-level information extraction from electronic health records by integrating structured tabular data through an expert-defined Bayesian network with neural classifiers that interpret unstructured clinical notes. The core innovation is the Consistency Node, which probabilistically fuses predictions from the BN and text classifiers alongside a Virtual Evidence mechanism, yielding better-calibrated, interpretable outputs and improved robustness to missing or shifted text information. Evaluated on the SimSUM dataset, the V-C-BN-text model consistently outperforms uni-modal and simple fusion baselines, particularly when text is incomplete or misleading, and maintains advantages under distribution shifts. The approach offers a flexible, interpretable framework for multi-modal information extraction with potential extensions to other modalities and broader clinical tasks, supported by code availability and conceptual generality.

Abstract

Electronic health records (EHRs) form an invaluable resource for training clinical decision support systems. To leverage the potential of such systems in high-risk applications, we need large, structured tabular datasets on which we can build transparent feature-based models. While part of the EHR already contains structured information (e.g. diagnosis codes, medications, and lab results), much of the information is contained within unstructured text (e.g. discharge summaries and nursing notes). In this work, we propose a method for multi-modal patient-level information extraction that leverages both the tabular features available in the patient's EHR (using an expert-informed Bayesian network) as well as clinical notes describing the patient's symptoms (using neural text classifiers). We propose the use of virtual evidence augmented with a consistency node to provide an interpretable, probabilistic fusion of the models' predictions. The consistency node improves the calibration of the final predictions compared to virtual evidence alone, allowing the Bayesian network to better adjust the neural classifier's output to handle missing information and resolve contradictions between the tabular and text data. We show the potential of our method on the SimSUM dataset, a simulated benchmark linking tabular EHRs with clinical notes through expert knowledge.

Paper Structure

This paper contains 37 sections, 11 equations, 2 figures, 20 tables.

Figures (2)

  • Figure 1: Overview of our patient-level information extraction method which integrates both tabular and text evidence. As an example, we show how to extract the probability that a patient suffers from Dyspnea, given tabular evidence that is already encoded in the EHR, a clinical note describing the patient's symptoms, and an expert-defined Bayesian network (BN) structure. On the right, the neural classifiers infer probabilities that the text mentions each symptom, with a 63% confidence for Dyspnea (in this case, "dyspnea" is not mentioned verbatim). The classifiers' probabilities for each symptom are provided as virtual evidence to the BN, via the red "virtual" nodes in the network. Given all tabular and virtual evidence, the BN infers that the patient has a 78% chance of Dyspnea -- since the patient has both Pneumonia and Asthma, this probability is high. The consistency node $VC_{\texttt{Dyspnea}}$ combines these probabilities, arriving at an 89% chance that this patient has Dyspnea. Part of this figure is adapted from SimSUM.
  • Figure 2: Comparison of our work with AIME_paloma. In this running example, we aim to predict the probability $\mathcal{P}(\texttt{Dysp} \mid \texttt{Pneu}, note)$ that a patient suffers from the symptom dyspnea (Dysp), given both tabular information (whether the patient has pneumonia, or Pneu, and was prescribed antibiotics, or Antibio) and a clinical note ($note$). In the generative and discriminative BN-text models proposed by AIME_paloma (A and B), a $note$ node is directly integrated in the BN, allowing one to perform Bayesian inference with mixed textual and tabular evidence. In our method, we split off the BN -- where Bayesian inference is performed only over tabular evidence -- and the neural text classifier, integrating their predictions through the consistency node $C_{\texttt{Dysp}}$ (C) and virtual evidence (D). Our method improves on the poor performance of A and poor scalability, interpretability, and causal structure of B.