Multi-objective Representation for Numbers in Clinical Narratives: A CamemBERT-Bio-Based Alternative to Large-Scale LLMs
Boammani Aser Lompo, Thanh-Dung Le
TL;DR
The paper tackles the challenge of understanding numerical values in clinical narratives, where traditional transformers struggle especially on small, imbalanced datasets. It introduces two strategies: (1) CamemBERT-bio with Label Embedding for Self-Attention (LESA) plus MLM prefinetuning, and (2) a multi-objective approach that combines LESA with Xval to capture both contextual and magnitude-based representations of numbers. The authors demonstrate that these methods yield significant improvements in eight physiological categories on CHUSJ data and can approach GPT-4 performance while remaining far more resource-efficient. The work offers a practical, privacy-conscious alternative for hospital-based clinical NLP tasks and provides a framework adaptable to other modalities beyond text.
Abstract
The processing of numerical values is a rapidly developing area in the field of Language Models (LLMs). Despite numerous advancements achieved by previous research, significant challenges persist, particularly within the healthcare domain. This paper investigates the limitations of Transformer models in understanding numerical values. \textit{Objective:} this research aims to categorize numerical values extracted from medical documents into eight specific physiological categories using CamemBERT-bio. \textit{Methods:} In a context where scalable methods and Large Language Models (LLMs) are emphasized, we explore lifting the limitations of transformer-based models. We examine two strategies: fine-tuning CamemBERT-bio on a small medical dataset, integrating Label Embedding for Self-Attention (LESA), and combining LESA with additional enhancement techniques such as Xval. Given that CamemBERT-bio is already pre-trained on a large medical dataset, the first approach aims to update its encoder with the newly added label embeddings technique. In contrast, the second approach seeks to develop multiple representations of numbers (contextual and magnitude-based) to achieve more robust number embeddings. \textit{Results:} As anticipated, fine-tuning the standard CamemBERT-bio on our small medical dataset did not improve F1 scores. However, significant improvements were observed with CamemBERT-bio + LESA, resulting in an over 13\% increase. Similar enhancements were noted when combining LESA with Xval, outperforming conventional methods and giving comparable results to GPT-4 \textit{Conclusions and Novelty:} This study introduces two innovative techniques for handling numerical data, which are also applicable to other modalities. We illustrate how these techniques can improve the performance of Transformer-based models, achieving more reliable classification results even with small datasets.
