Table of Contents
Fetching ...

Automating Weak Label Generation for Data Programming with Clinicians in the Loop

Jean Park, Sydney Pugh, Kaustubh Sridhar, Mengyu Liu, Navish Yarna, Ramneet Kaur, Souradeep Dutta, Elena Bernardis, Oleg Sokolsky, Insup Lee

TL;DR

This work tackles data scarcity in medical AI by introducing a clinician-in-the-loop, memory-driven weak-labeling framework for data programming. It replaces hard-to-specify labeling functions on high-dimensional data with distance-based prototypical labeling: the clinician labels a small set of memories, which induces labels for the rest of the data, and multiple such weak-label sets are fused via data programming. The proposed memory-generation algorithm (CLARANS-like) selects representative samples, guided by a distance threshold, and two domain-specific distance metrics (DTW for time series; CLIP-derived features with KL divergence or Euclidean distance for images) underlie the approach. Empirical results on 3,265 medical time-series alarms and 6,293 dermatology images show significant improvements over baselines (majority vote and Snuba), with particularly strong gains when using probability-distribution features and KL-based distances. The method reduces clinician labeling workload while delivering higher-quality labels, enabling more scalable medical AI workflows.

Abstract

Large Deep Neural Networks (DNNs) are often data hungry and need high-quality labeled data in copious amounts for learning to converge. This is a challenge in the field of medicine since high quality labeled data is often scarce. Data programming has been the ray of hope in this regard, since it allows us to label unlabeled data using multiple weak labeling functions. Such functions are often supplied by a domain expert. Data-programming can combine multiple weak labeling functions and suggest labels better than simple majority voting over the different functions. However, it is not straightforward to express such weak labeling functions, especially in high-dimensional settings such as images and time-series data. What we propose in this paper is a way to bypass this issue, using distance functions. In high-dimensional spaces, it is easier to find meaningful distance metrics which can generalize across different labeling tasks. We propose an algorithm that queries an expert for labels of a few representative samples of the dataset. These samples are carefully chosen by the algorithm to capture the distribution of the dataset. The labels assigned by the expert on the representative subset induce a labeling on the full dataset, thereby generating weak labels to be used in the data programming pipeline. In our medical time series case study, labeling a subset of 50 to 130 out of 3,265 samples showed 17-28% improvement in accuracy and 13-28% improvement in F1 over the baseline using clinician-defined labeling functions. In our medical image case study, labeling a subset of about 50 to 120 images from 6,293 unlabeled medical images using our approach showed significant improvement over the baseline method, Snuba, with an increase of approximately 5-15% in accuracy and 12-19% in F1 score.

Automating Weak Label Generation for Data Programming with Clinicians in the Loop

TL;DR

This work tackles data scarcity in medical AI by introducing a clinician-in-the-loop, memory-driven weak-labeling framework for data programming. It replaces hard-to-specify labeling functions on high-dimensional data with distance-based prototypical labeling: the clinician labels a small set of memories, which induces labels for the rest of the data, and multiple such weak-label sets are fused via data programming. The proposed memory-generation algorithm (CLARANS-like) selects representative samples, guided by a distance threshold, and two domain-specific distance metrics (DTW for time series; CLIP-derived features with KL divergence or Euclidean distance for images) underlie the approach. Empirical results on 3,265 medical time-series alarms and 6,293 dermatology images show significant improvements over baselines (majority vote and Snuba), with particularly strong gains when using probability-distribution features and KL-based distances. The method reduces clinician labeling workload while delivering higher-quality labels, enabling more scalable medical AI workflows.

Abstract

Large Deep Neural Networks (DNNs) are often data hungry and need high-quality labeled data in copious amounts for learning to converge. This is a challenge in the field of medicine since high quality labeled data is often scarce. Data programming has been the ray of hope in this regard, since it allows us to label unlabeled data using multiple weak labeling functions. Such functions are often supplied by a domain expert. Data-programming can combine multiple weak labeling functions and suggest labels better than simple majority voting over the different functions. However, it is not straightforward to express such weak labeling functions, especially in high-dimensional settings such as images and time-series data. What we propose in this paper is a way to bypass this issue, using distance functions. In high-dimensional spaces, it is easier to find meaningful distance metrics which can generalize across different labeling tasks. We propose an algorithm that queries an expert for labels of a few representative samples of the dataset. These samples are carefully chosen by the algorithm to capture the distribution of the dataset. The labels assigned by the expert on the representative subset induce a labeling on the full dataset, thereby generating weak labels to be used in the data programming pipeline. In our medical time series case study, labeling a subset of 50 to 130 out of 3,265 samples showed 17-28% improvement in accuracy and 13-28% improvement in F1 over the baseline using clinician-defined labeling functions. In our medical image case study, labeling a subset of about 50 to 120 images from 6,293 unlabeled medical images using our approach showed significant improvement over the baseline method, Snuba, with an increase of approximately 5-15% in accuracy and 12-19% in F1 score.
Paper Structure (23 sections, 5 equations, 4 figures, 6 tables, 1 algorithm)

This paper contains 23 sections, 5 equations, 4 figures, 6 tables, 1 algorithm.

Figures (4)

  • Figure 1: Overall Approach: Starting from a dataset of unlabeled samples, we generate different partitions using different seeds. These partitions are centered around real prototypical samples from the dataset referred to as memories. Next, an expert clinician assigns a label to each prototype. This induces labels on the full dataset. Finally the the memory-induced sets of weak labels are combined using a data-programming tool to arrive at better labels.
  • Figure 2: Medical image data examples. Sample dermatological images taken by patients for teledermatology consultation.
  • Figure 3: F1 and Accuracy of our approach using varying number of labeled samples ($N_l$) for the medical image case study.
  • Figure 4: F1 and Accuracy of labels output by our approach using varying number of labeled samples ($N_L$) for the medical time series case study.