Table of Contents
Fetching ...

ODD: A Benchmark Dataset for the Natural Language Processing based Opioid Related Aberrant Behavior Detection

Sunjae Kwon, Xun Wang, Weisong Liu, Emily Druhl, Minhee L. Sung, Joel I. Reisman, Wenjun Li, Robert D. Kerns, William Becker, Hong Yu

TL;DR

This work introduces ODD, a public benchmark dataset for ORAB detection in EHR notes, structured as a nine-label multi-label task spanning two aberrant behavior types and seven related opioid concepts. It compares fine-tuning and prompt-based fine-tuning using BioBERT variants, showing prompt-based methods yield stronger macro AUPRC, especially for rare categories, with a best macro AUPRC of 88.17. The dataset comprises 3,718 labeled instances across 2,840 sentences from 750 notes in MIMIC-IV, with strong inter-annotator agreement (κ = 0.86). The paper provides detailed error and socio-demographic analyses, reports substantial improvements with prompt-based tuning, and discusses data augmentation and ethical considerations for deploying ORAB detection in clinical settings. Overall, ODD offers a rigorous, domain-specific benchmark to drive advances in NLP for opioid-related risk assessment and abuse detection in healthcare records.

Abstract

Opioid related aberrant behaviors (ORABs) present novel risk factors for opioid overdose. This paper introduces a novel biomedical natural language processing benchmark dataset named ODD, for ORAB Detection Dataset. ODD is an expert-annotated dataset designed to identify ORABs from patients' EHR notes and classify them into nine categories; 1) Confirmed Aberrant Behavior, 2) Suggested Aberrant Behavior, 3) Opioids, 4) Indication, 5) Diagnosed opioid dependency, 6) Benzodiazepines, 7) Medication Changes, 8) Central Nervous System-related, and 9) Social Determinants of Health. We explored two state-of-the-art natural language processing models (fine-tuning and prompt-tuning approaches) to identify ORAB. Experimental results show that the prompt-tuning models outperformed the fine-tuning models in most categories and the gains were especially higher among uncommon categories (Suggested Aberrant Behavior, Confirmed Aberrant Behaviors, Diagnosed Opioid Dependence, and Medication Change). Although the best model achieved the highest 88.17% on macro average area under precision recall curve, uncommon classes still have a large room for performance improvement. ODD is publicly available.

ODD: A Benchmark Dataset for the Natural Language Processing based Opioid Related Aberrant Behavior Detection

TL;DR

This work introduces ODD, a public benchmark dataset for ORAB detection in EHR notes, structured as a nine-label multi-label task spanning two aberrant behavior types and seven related opioid concepts. It compares fine-tuning and prompt-based fine-tuning using BioBERT variants, showing prompt-based methods yield stronger macro AUPRC, especially for rare categories, with a best macro AUPRC of 88.17. The dataset comprises 3,718 labeled instances across 2,840 sentences from 750 notes in MIMIC-IV, with strong inter-annotator agreement (κ = 0.86). The paper provides detailed error and socio-demographic analyses, reports substantial improvements with prompt-based tuning, and discusses data augmentation and ethical considerations for deploying ORAB detection in clinical settings. Overall, ODD offers a rigorous, domain-specific benchmark to drive advances in NLP for opioid-related risk assessment and abuse detection in healthcare records.

Abstract

Opioid related aberrant behaviors (ORABs) present novel risk factors for opioid overdose. This paper introduces a novel biomedical natural language processing benchmark dataset named ODD, for ORAB Detection Dataset. ODD is an expert-annotated dataset designed to identify ORABs from patients' EHR notes and classify them into nine categories; 1) Confirmed Aberrant Behavior, 2) Suggested Aberrant Behavior, 3) Opioids, 4) Indication, 5) Diagnosed opioid dependency, 6) Benzodiazepines, 7) Medication Changes, 8) Central Nervous System-related, and 9) Social Determinants of Health. We explored two state-of-the-art natural language processing models (fine-tuning and prompt-tuning approaches) to identify ORAB. Experimental results show that the prompt-tuning models outperformed the fine-tuning models in most categories and the gains were especially higher among uncommon categories (Suggested Aberrant Behavior, Confirmed Aberrant Behaviors, Diagnosed Opioid Dependence, and Medication Change). Although the best model achieved the highest 88.17% on macro average area under precision recall curve, uncommon classes still have a large room for performance improvement. ODD is publicly available.
Paper Structure (40 sections, 2 figures, 11 tables)

This paper contains 40 sections, 2 figures, 11 tables.

Figures (2)

  • Figure 1: The figures illustrate the conceptual architectures of our ORAB detection models. (a) demonstrates a fine-tuning model and (b) depicts a prompt-based fine-tuning model. Herein, $\textbf{x}$, $\textbf{y}$, and $\textbf{p}$ indicate input text, output labels, and prompt text respectively. $\mathbf{h_i}$ is the hidden vector representation of the $i^{th}$ input token. EHR text input to '{text placeholder}'. The name of each category ($c_{1...n}$) in Table \ref{['tab:label_examples']} is input at '{$c_{1...n}$ placeholder}'.
  • Figure 2: A multi-label confusion matrix among categories. 'O' indicates the none of any categories.