NOWJ @BioCreative IX ToxHabits: An Ensemble Deep Learning Approach for Detecting Substance Use and Contextual Information in Clinical Texts
Huu-Huy-Hoang Tran, Gia-Bao Duong, Quoc-Viet-Anh Tran, Thi-Hai-Yen Vuong, Hoang-Quynh Le
TL;DR
The paper addresses the challenge of extracting substance-use information from Spanish clinical texts in a low-resource setting. It proposes a multi-output ensemble system built on a BETO-based BERT-CRF architecture with a sentence-filtering pre-processing step and majority-voting ensemble to jointly detect triggers (Tobacco, Cannabis, Alcohol, Drug) and contextual arguments (Type, Method, Amount, Frequency, Duration, History). Evaluated on the ToxHabits dataset of 1,499 Spanish case reports, the approach achieves a best Subtask-1 F1 of 0.94 (precision 0.97) and Subtask-2 F1 of 0.91 (precision ~0.95), with sentence filtering improving precision and overall robustness. The work demonstrates that careful architectural design and ensembling can yield strong performance in domain-specific, low-resource clinical NLP without relying on large language models, offering practical improvements for clinical decision support and public health surveillance in Spanish-speaking contexts.
Abstract
Extracting drug use information from unstructured Electronic Health Records remains a major challenge in clinical Natural Language Processing. While Large Language Models demonstrate advancements, their use in clinical NLP is limited by concerns over trust, control, and efficiency. To address this, we present NOWJ submission to the ToxHabits Shared Task at BioCreative IX. This task targets the detection of toxic substance use and contextual attributes in Spanish clinical texts, a domain-specific, low-resource setting. We propose a multi-output ensemble system tackling both Subtask 1 - ToxNER and Subtask 2 - ToxUse. Our system integrates BETO with a CRF layer for sequence labeling, employs diverse training strategies, and uses sentence filtering to boost precision. Our top run achieved 0.94 F1 and 0.97 precision for Trigger Detection, and 0.91 F1 for Argument Detection.
