Table of Contents
Fetching ...

Multi-objective Representation for Numbers in Clinical Narratives: A CamemBERT-Bio-Based Alternative to Large-Scale LLMs

Boammani Aser Lompo, Thanh-Dung Le

TL;DR

The paper tackles the challenge of understanding numerical values in clinical narratives, where traditional transformers struggle especially on small, imbalanced datasets. It introduces two strategies: (1) CamemBERT-bio with Label Embedding for Self-Attention (LESA) plus MLM prefinetuning, and (2) a multi-objective approach that combines LESA with Xval to capture both contextual and magnitude-based representations of numbers. The authors demonstrate that these methods yield significant improvements in eight physiological categories on CHUSJ data and can approach GPT-4 performance while remaining far more resource-efficient. The work offers a practical, privacy-conscious alternative for hospital-based clinical NLP tasks and provides a framework adaptable to other modalities beyond text.

Abstract

The processing of numerical values is a rapidly developing area in the field of Language Models (LLMs). Despite numerous advancements achieved by previous research, significant challenges persist, particularly within the healthcare domain. This paper investigates the limitations of Transformer models in understanding numerical values. \textit{Objective:} this research aims to categorize numerical values extracted from medical documents into eight specific physiological categories using CamemBERT-bio. \textit{Methods:} In a context where scalable methods and Large Language Models (LLMs) are emphasized, we explore lifting the limitations of transformer-based models. We examine two strategies: fine-tuning CamemBERT-bio on a small medical dataset, integrating Label Embedding for Self-Attention (LESA), and combining LESA with additional enhancement techniques such as Xval. Given that CamemBERT-bio is already pre-trained on a large medical dataset, the first approach aims to update its encoder with the newly added label embeddings technique. In contrast, the second approach seeks to develop multiple representations of numbers (contextual and magnitude-based) to achieve more robust number embeddings. \textit{Results:} As anticipated, fine-tuning the standard CamemBERT-bio on our small medical dataset did not improve F1 scores. However, significant improvements were observed with CamemBERT-bio + LESA, resulting in an over 13\% increase. Similar enhancements were noted when combining LESA with Xval, outperforming conventional methods and giving comparable results to GPT-4 \textit{Conclusions and Novelty:} This study introduces two innovative techniques for handling numerical data, which are also applicable to other modalities. We illustrate how these techniques can improve the performance of Transformer-based models, achieving more reliable classification results even with small datasets.

Multi-objective Representation for Numbers in Clinical Narratives: A CamemBERT-Bio-Based Alternative to Large-Scale LLMs

TL;DR

The paper tackles the challenge of understanding numerical values in clinical narratives, where traditional transformers struggle especially on small, imbalanced datasets. It introduces two strategies: (1) CamemBERT-bio with Label Embedding for Self-Attention (LESA) plus MLM prefinetuning, and (2) a multi-objective approach that combines LESA with Xval to capture both contextual and magnitude-based representations of numbers. The authors demonstrate that these methods yield significant improvements in eight physiological categories on CHUSJ data and can approach GPT-4 performance while remaining far more resource-efficient. The work offers a practical, privacy-conscious alternative for hospital-based clinical NLP tasks and provides a framework adaptable to other modalities beyond text.

Abstract

The processing of numerical values is a rapidly developing area in the field of Language Models (LLMs). Despite numerous advancements achieved by previous research, significant challenges persist, particularly within the healthcare domain. This paper investigates the limitations of Transformer models in understanding numerical values. \textit{Objective:} this research aims to categorize numerical values extracted from medical documents into eight specific physiological categories using CamemBERT-bio. \textit{Methods:} In a context where scalable methods and Large Language Models (LLMs) are emphasized, we explore lifting the limitations of transformer-based models. We examine two strategies: fine-tuning CamemBERT-bio on a small medical dataset, integrating Label Embedding for Self-Attention (LESA), and combining LESA with additional enhancement techniques such as Xval. Given that CamemBERT-bio is already pre-trained on a large medical dataset, the first approach aims to update its encoder with the newly added label embeddings technique. In contrast, the second approach seeks to develop multiple representations of numbers (contextual and magnitude-based) to achieve more robust number embeddings. \textit{Results:} As anticipated, fine-tuning the standard CamemBERT-bio on our small medical dataset did not improve F1 scores. However, significant improvements were observed with CamemBERT-bio + LESA, resulting in an over 13\% increase. Similar enhancements were noted when combining LESA with Xval, outperforming conventional methods and giving comparable results to GPT-4 \textit{Conclusions and Novelty:} This study introduces two innovative techniques for handling numerical data, which are also applicable to other modalities. We illustrate how these techniques can improve the performance of Transformer-based models, achieving more reliable classification results even with small datasets.
Paper Structure (22 sections, 16 equations, 7 figures, 3 tables)

This paper contains 22 sections, 16 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: (a) Overlook to the whole numbers distribution (b) A focus on the small numbers distribution. Non integers numbers are effectively represented in the dataset.
  • Figure 2: Output of a completion task by CamemBERT-bio. The input sentence is "Patient en détresse respiratoire, gradient VG-VD ad $\langle$ mask $\rangle$ mmgh.". The figure contains the possible words to fill the mask with their corresponding probabilities.
  • Figure 3: Excerpt from lompo2025mediumsizedtransformersmodelsrelevant. An illustration of the Label Embedding for Self-Attention (LESA) is shown. The input to this Self-Attention layer consists of the token embeddings $[X_{CLS}, X_1, \cdots, X_L]$ and the keyword embeddings $[X^l_1, X^l_2, \cdots, X^l_n]$. This layer outputs the enhanced self-attention.
  • Figure 4: Brief illustration of the inference process using multiple representations
  • Figure 5: The numbers predicted by Model 2. The $x$ axis represents the ground truth values, and the $y$ axis represents the predicted values. Both axes are displayed on a logarithmic scale
  • ...and 2 more figures