Table of Contents
Fetching ...

Mining Unstructured Medical Texts With Conformal Active Learning

Juliano Genari, Guilherme Tegoni Goedert

TL;DR

The paper tackles the high labeling burden and privacy concerns in extracting actionable signals from unstructured EHR text for epidemiological surveillance. It introduces Conformal Active Learning, a model-agnostic framework that blends active learning with label-conditional conformal prediction to deliver reliable, uncertainty-aware classifications while minimizing manual labeling. Key contributions include a clustering-based, diversity-aware labeling strategy, open-source code under GPLv3, and a lightweight deployment workflow (OLIM) suitable for on-site healthcare settings. Experiments on a proxy Amazon-review dataset show that strong performance can be achieved with as few as 200 labeled texts, with deep models not always providing advantages in resource-constrained environments, highlighting practical benefits for privacy-preserving real-time monitoring.

Abstract

The extraction of relevant data from Electronic Health Records (EHRs) is crucial to identifying symptoms and automating epidemiological surveillance processes. By harnessing the vast amount of unstructured text in EHRs, we can detect patterns that indicate the onset of disease outbreaks, enabling faster, more targeted public health responses. Our proposed framework provides a flexible and efficient solution for mining data from unstructured texts, significantly reducing the need for extensive manual labeling by specialists. Experiments show that our framework achieving strong performance with as few as 200 manually labeled texts, even for complex classification problems. Additionally, our approach can function with simple lightweight models, achieving competitive and occasionally even better results compared to more resource-intensive deep learning models. This capability not only accelerates processing times but also preserves patient privacy, as the data can be processed on weaker on-site hardware rather than being transferred to external systems. Our methodology, therefore, offers a practical, scalable, and privacy-conscious approach to real-time epidemiological monitoring, equipping health institutions to respond rapidly and effectively to emerging health threats.

Mining Unstructured Medical Texts With Conformal Active Learning

TL;DR

The paper tackles the high labeling burden and privacy concerns in extracting actionable signals from unstructured EHR text for epidemiological surveillance. It introduces Conformal Active Learning, a model-agnostic framework that blends active learning with label-conditional conformal prediction to deliver reliable, uncertainty-aware classifications while minimizing manual labeling. Key contributions include a clustering-based, diversity-aware labeling strategy, open-source code under GPLv3, and a lightweight deployment workflow (OLIM) suitable for on-site healthcare settings. Experiments on a proxy Amazon-review dataset show that strong performance can be achieved with as few as 200 labeled texts, with deep models not always providing advantages in resource-constrained environments, highlighting practical benefits for privacy-preserving real-time monitoring.

Abstract

The extraction of relevant data from Electronic Health Records (EHRs) is crucial to identifying symptoms and automating epidemiological surveillance processes. By harnessing the vast amount of unstructured text in EHRs, we can detect patterns that indicate the onset of disease outbreaks, enabling faster, more targeted public health responses. Our proposed framework provides a flexible and efficient solution for mining data from unstructured texts, significantly reducing the need for extensive manual labeling by specialists. Experiments show that our framework achieving strong performance with as few as 200 manually labeled texts, even for complex classification problems. Additionally, our approach can function with simple lightweight models, achieving competitive and occasionally even better results compared to more resource-intensive deep learning models. This capability not only accelerates processing times but also preserves patient privacy, as the data can be processed on weaker on-site hardware rather than being transferred to external systems. Our methodology, therefore, offers a practical, scalable, and privacy-conscious approach to real-time epidemiological monitoring, equipping health institutions to respond rapidly and effectively to emerging health threats.

Paper Structure

This paper contains 18 sections, 1 theorem, 5 equations, 2 figures, 3 tables.

Key Result

Theorem 1

Suppose that the data $(X_1, Y_1), \ldots, (X_n, Y_n)$, along with a new data point $(X^\mathrm{new}, Y^\mathrm{new})$, are exchangeable random variables. Then, for any conformity score $s$, any $\alpha \in (0, 1)$, it holds that, for all label values $y$,

Figures (2)

  • Figure 1: Diagram of the active leaning cycle.
  • Figure 2: Convergence of AUC-ROC for the Pet product label, using the XGBBoost model, with only high uncertainty and a 70/30 mix of high and low uncertainty.

Theorems & Definitions (1)

  • Theorem 1