Table of Contents
Fetching ...

SimSUM: Simulated Benchmark with Structured and Unstructured Medical Records

Paloma Rabaey, Stefan Heytens, Thomas Demeester

TL;DR

SimSUM introduces a fully synthetic benchmark that links structured tabular background data with unstructured clinical notes in a respiratory domain via an expert-defined Bayesian network. The dataset comprises 10,000 records with BN-generated tabular features and GPT-4o-generated notes, annotated with symptom spans for information extraction research. The paper demonstrates that incorporating background tabular information improves symptom extraction, especially for harder-to-predict symptoms, and provides extensive baseline models across tabular, textual, and multimodal inputs. It further discusses expert evaluation, span-based analysis, and multiple intended uses including multimodal CIE research, causal inference with textual confounders, and synthetic data benchmarking, while explicitly cautioning against production deployment.

Abstract

Clinical information extraction, which involves structuring clinical concepts from unstructured medical text, remains a challenging problem that could benefit from the inclusion of tabular background information available in electronic health records. Existing open-source datasets lack explicit links between structured features and clinical concepts in the text, motivating the need for a new research dataset. We introduce SimSUM, a benchmark dataset of 10,000 simulated patient records that link unstructured clinical notes with structured background variables. Each record simulates a patient encounter in the domain of respiratory diseases and includes tabular data (e.g., symptoms, diagnoses, underlying conditions) generated from a Bayesian network whose structure and parameters are defined by domain experts. A large language model (GPT-4o) is prompted to generate a clinical note describing the encounter, including symptoms and relevant context. These notes are annotated with span-level symptom mentions. We conduct an expert evaluation to assess note quality and run baseline predictive models on both the tabular and textual data. The SimSUM dataset is primarily designed to support research on clinical information extraction in the presence of tabular background variables, which can be linked through domain knowledge to concepts of interest to be extracted from the text -- namely, symptoms in the case of SimSUM. Secondary uses include research on the automation of clinical reasoning over both tabular data and text, causal effect estimation in the presence of tabular and/or textual confounders, and multi-modal synthetic data generation. SimSUM is not intended for training clinical decision support systems or production-grade models, but rather to facilitate reproducible research in a simplified and controlled setting.

SimSUM: Simulated Benchmark with Structured and Unstructured Medical Records

TL;DR

SimSUM introduces a fully synthetic benchmark that links structured tabular background data with unstructured clinical notes in a respiratory domain via an expert-defined Bayesian network. The dataset comprises 10,000 records with BN-generated tabular features and GPT-4o-generated notes, annotated with symptom spans for information extraction research. The paper demonstrates that incorporating background tabular information improves symptom extraction, especially for harder-to-predict symptoms, and provides extensive baseline models across tabular, textual, and multimodal inputs. It further discusses expert evaluation, span-based analysis, and multiple intended uses including multimodal CIE research, causal inference with textual confounders, and synthetic data benchmarking, while explicitly cautioning against production deployment.

Abstract

Clinical information extraction, which involves structuring clinical concepts from unstructured medical text, remains a challenging problem that could benefit from the inclusion of tabular background information available in electronic health records. Existing open-source datasets lack explicit links between structured features and clinical concepts in the text, motivating the need for a new research dataset. We introduce SimSUM, a benchmark dataset of 10,000 simulated patient records that link unstructured clinical notes with structured background variables. Each record simulates a patient encounter in the domain of respiratory diseases and includes tabular data (e.g., symptoms, diagnoses, underlying conditions) generated from a Bayesian network whose structure and parameters are defined by domain experts. A large language model (GPT-4o) is prompted to generate a clinical note describing the encounter, including symptoms and relevant context. These notes are annotated with span-level symptom mentions. We conduct an expert evaluation to assess note quality and run baseline predictive models on both the tabular and textual data. The SimSUM dataset is primarily designed to support research on clinical information extraction in the presence of tabular background variables, which can be linked through domain knowledge to concepts of interest to be extracted from the text -- namely, symptoms in the case of SimSUM. Secondary uses include research on the automation of clinical reasoning over both tabular data and text, causal effect estimation in the presence of tabular and/or textual confounders, and multi-modal synthetic data generation. SimSUM is not intended for training clinical decision support systems or production-grade models, but rather to facilitate reproducible research in a simplified and controlled setting.
Paper Structure (42 sections, 9 equations, 11 figures, 9 tables)

This paper contains 42 sections, 9 equations, 11 figures, 9 tables.

Figures (11)

  • Figure 1: General overview of the structure of this paper. We first describe the construction of SimSUM, a simulated dataset combining structured tabular data and unstructured clinical text (Section \ref{['sec:methodology']}). We then evaluate the clinical notes through both a clinical expert review (Section \ref{['sec:expert_evaluation']}) and automated span-level symptom extraction (Sections \ref{['sec:span_extraction']} and \ref{['sec:utility_spans']}). Finally, we present four baseline predictive models that take as input the tabular data, the textual data, or both (Section \ref{['sec:CIE_baselines']}).
  • Figure 2: We have a clinical description of a patient encounter from which we want to extract some concepts, in this case the symptoms experienced by the patient. Some symptoms might be easy to extract using text-matching, like "high fever". Other symptoms are not mentioned verbatim and are therefore harder to extract, like dyspnea. In this case, additional information on the patient, present in encoded format in the tabular portion of the EHR, together with domain knowledge, may help. We illustrate this idea with two examples. Example 1: We know that the patient has asthma. Domain knowledge may tell us that the probability of experiencing dyspnea when one has asthma (Equation (1)) is 90%, thereby increasing the prior probability of encountering dyspnea in the text. By integrating this knowledge in the information extraction module, it can more accurately predict the posterior probability of encountering dyspnea in the text. Example 2: We know that the patient is experiencing high fever. Domain knowledge may tell us that a high fever often co-occurs with dyspnea due to their common cause, which is pneumonia. Even if we do not know that the patient has pneumonia, the probability of dyspnea being mentioned in the text increases as a result of observing high fever. By modeling the joint probability of dyspnea, fever and pneumonia using a Bayesian network, we can get the exact probability of $\mathcal{P}(\text{dyspnea = yes} \mid \text{fever = high})$ by summing over the possible presence and absence of pneumonia in a procedure called Bayesian inference (Equation (2), koller2009probabilistic).
  • Figure 3: Overview of the full data generating process for the SimSUM dataset. First, the tabular portion of the artificial patient record is sampled from a Bayesian network, where both the structure and the conditional probability distributions were defined by an expert. Afterwards, we construct a prompt describing the symptoms experienced by the patient, as well as their underlying health conditions (but no diagnoses). We ask the GPT-4o large language model to generate a clinical note describing this patient encounter. Finally, we ask to generate a more challenging compact version of the note, mimicking the complexity of real clinical notes by prompting the use of abbreviations and shortcuts. We generate $10{,}000$ of these artificial patient records in total.
  • Figure 4: Conditional probability tables for the variables asthma, smoking, hay fever, COPD, season, pneumonia, common cold, fever, policy and self-employed.
  • Figure 5: Prompting strategy for extracting symptom spans from (a) normal and (b) compact clinical notes using a large language model.
  • ...and 6 more figures