Table of Contents
Fetching ...

Keeping Experts in the Loop: Expert-Guided Optimization for Clinical Data Classification using Large Language Models

Nader Karayanni, Aya Awwad, Chein-Lien Hsiao, Surish P Shanmugam

TL;DR

This work tackles the prompt-engineering bottleneck in healthcare LLM applications by introducing StructEase, an expert-in-the-loop framework for classifying unstructured clinical notes. It combines iterative prompt refinement with a principled data-sampling strategy (SamplEase) to maximize expert impact while minimizing labeling effort. On a NEISS-based helmet-status classification task, StructEase yields consistent improvements in macro-F1 and per-class metrics, surpassing baseline and DSPy approaches, with notable gains in the most challenging class and minimal bias across demographic groups. The approach emphasizes transparency, scalability, and practical deployment considerations, including open-source implementation and Dockerization, to enable broader adoption in healthcare AI workflows.

Abstract

Since the emergence of Large Language Models (LLMs), the challenge of effectively leveraging their potential in healthcare has taken center stage. A critical barrier to using LLMs for extracting insights from unstructured clinical notes lies in the prompt engineering process. Despite its pivotal role in determining task performance, a clear framework for prompt optimization remains absent. Current methods to address this gap take either a manual prompt refinement approach, where domain experts collaborate with prompt engineers to create an optimal prompt, which is time-intensive and difficult to scale, or through employing automatic prompt optimizing approaches, where the value of the input of domain experts is not fully realized. To address this, we propose StructEase, a novel framework that bridges the gap between automation and the input of human expertise in prompt engineering. A core innovation of the framework is SamplEase, an iterative sampling algorithm that identifies high-value cases where expert feedback drives significant performance improvements. This approach minimizes expert intervention, to effectively enhance classification outcomes. This targeted approach reduces labeling redundancy, mitigates human error, and enhances classification outcomes. We evaluated the performance of StructEase using a dataset of de-identified clinical narratives from the US National Electronic Injury Surveillance System (NEISS), demonstrating significant gains in classification performance compared to current methods. Our findings underscore the value of expert integration in LLM workflows, achieving notable improvements in F1 score while maintaining minimal expert effort. By combining transparency, flexibility, and scalability, StructEase sets the foundation for a framework to integrate expert input into LLM workflows in healthcare and beyond.

Keeping Experts in the Loop: Expert-Guided Optimization for Clinical Data Classification using Large Language Models

TL;DR

This work tackles the prompt-engineering bottleneck in healthcare LLM applications by introducing StructEase, an expert-in-the-loop framework for classifying unstructured clinical notes. It combines iterative prompt refinement with a principled data-sampling strategy (SamplEase) to maximize expert impact while minimizing labeling effort. On a NEISS-based helmet-status classification task, StructEase yields consistent improvements in macro-F1 and per-class metrics, surpassing baseline and DSPy approaches, with notable gains in the most challenging class and minimal bias across demographic groups. The approach emphasizes transparency, scalability, and practical deployment considerations, including open-source implementation and Dockerization, to enable broader adoption in healthcare AI workflows.

Abstract

Since the emergence of Large Language Models (LLMs), the challenge of effectively leveraging their potential in healthcare has taken center stage. A critical barrier to using LLMs for extracting insights from unstructured clinical notes lies in the prompt engineering process. Despite its pivotal role in determining task performance, a clear framework for prompt optimization remains absent. Current methods to address this gap take either a manual prompt refinement approach, where domain experts collaborate with prompt engineers to create an optimal prompt, which is time-intensive and difficult to scale, or through employing automatic prompt optimizing approaches, where the value of the input of domain experts is not fully realized. To address this, we propose StructEase, a novel framework that bridges the gap between automation and the input of human expertise in prompt engineering. A core innovation of the framework is SamplEase, an iterative sampling algorithm that identifies high-value cases where expert feedback drives significant performance improvements. This approach minimizes expert intervention, to effectively enhance classification outcomes. This targeted approach reduces labeling redundancy, mitigates human error, and enhances classification outcomes. We evaluated the performance of StructEase using a dataset of de-identified clinical narratives from the US National Electronic Injury Surveillance System (NEISS), demonstrating significant gains in classification performance compared to current methods. Our findings underscore the value of expert integration in LLM workflows, achieving notable improvements in F1 score while maintaining minimal expert effort. By combining transparency, flexibility, and scalability, StructEase sets the foundation for a framework to integrate expert input into LLM workflows in healthcare and beyond.

Paper Structure

This paper contains 17 sections, 3 equations, 3 figures, 1 table, 1 algorithm.

Figures (3)

  • Figure 1: The proposed framework of StructEase
  • Figure 2: Aggregated performance metrics across iterative prompt refinements ($P_0$, $P_1$, $P_2$). (a) Overall macro-level metrics (Macro F1-Score, Macro Precision, and Macro Recall), showing consistent improvements with each iteration of the prompt. Error bars indicate 95% confidence intervals. (b) Per-class performance analysis of Macro F1-Score for the "Helmet present," "No Helmet," and "Not mentioned" categories. Iterative refinements yielded performance improvements, particularly for the "No Helmet" class, which had the lowest initial performance at $P_0$.
  • Figure 3: Comparison of performance metrics between smart sampling and random sampling strategies. Each point represents a performance metric (Macro F1-Score, Macro Precision, Macro Recall) from six independent runs for each sampling method. Triangles indicate the median performance for each metric within the sampling method.