Infusing clinical knowledge into tokenisers for language models

Abul Hasan; Jinge Wu; Quang Ngoc Nguyen; Salomé Andres; Imane Guellil; Huayu Zhang; Arlene Casey; Beatrice Alex; Bruce Guthrie; Honghan Wu

Infusing clinical knowledge into tokenisers for language models

Abul Hasan, Jinge Wu, Quang Ngoc Nguyen, Salomé Andres, Imane Guellil, Huayu Zhang, Arlene Casey, Beatrice Alex, Bruce Guthrie, Honghan Wu

TL;DR

The paper addresses the bottleneck of tokenisation in clinical LMs by introducing K-Tokeniser, a knowledge-infused tokeniser that augments baseline vocabularies with semantic-type subwords derived from UMLS, MIMIC-III, or PubMed. It integrates global semantic representations with local sentence context through Word Optimisation (entropy minimisation) and Sequence Optimisation (fertility-based switching) without requiring pretraining. Across four clinical NLP tasks and multiple models, K-Tokeniser yields consistent performance gains (notably up to Micro $F_1$ improvements of 13% in automated coding) and demonstrates faster convergence with reduced data needs. This approach offers a generalisable pathway to incorporate domain knowledge into tokenisation, improving efficiency and effectiveness in clinical text analytics without costly pretraining.

Abstract

This study introduces a novel knowledge enhanced tokenisation mechanism, K-Tokeniser, for clinical text processing. Technically, at initialisation stage, K-Tokeniser populates global representations of tokens based on semantic types of domain concepts (such as drugs or diseases) from either a domain ontology like Unified Medical Language System or the training data of the task related corpus. At training or inference stage, sentence level localised context will be utilised for choosing the optimal global token representation to realise the semantic-based tokenisation. To avoid pretraining using the new tokeniser, an embedding initialisation approach is proposed to generate representations for new tokens. Using three transformer-based language models, a comprehensive set of experiments are conducted on four real-world datasets for evaluating K-Tokeniser in a wide range of clinical text analytics tasks including clinical concept and relation extraction, automated clinical coding, clinical phenotype identification, and clinical research article classification. Overall, our models demonstrate consistent improvements over their counterparts in all tasks. In particular, substantial improvements are observed in the automated clinical coding task with 13\% increase on Micro $F_1$ score. Furthermore, K-Tokeniser also shows significant capacities in facilitating quicker converge of language models. Specifically, using K-Tokeniser, the language models would only require 50\% of the training data to achieve the best performance of the baseline tokeniser using all training data in the concept extraction task and less than 20\% of the data for the automated coding task. It is worth mentioning that all these improvements require no pre-training process, making the approach generalisable.

Infusing clinical knowledge into tokenisers for language models

TL;DR

improvements of 13% in automated coding) and demonstrates faster convergence with reduced data needs. This approach offers a generalisable pathway to incorporate domain knowledge into tokenisation, improving efficiency and effectiveness in clinical text analytics without costly pretraining.

Abstract

score. Furthermore, K-Tokeniser also shows significant capacities in facilitating quicker converge of language models. Specifically, using K-Tokeniser, the language models would only require 50\% of the training data to achieve the best performance of the baseline tokeniser using all training data in the concept extraction task and less than 20\% of the data for the automated coding task. It is worth mentioning that all these improvements require no pre-training process, making the approach generalisable.

Paper Structure (10 sections, 5 equations, 2 figures, 7 tables)

This paper contains 10 sections, 5 equations, 2 figures, 7 tables.

Introduction
Overview of Tokenisation and K-Tokeniser
Tokenisation Background
A Novel Tokenisation Framework - K-Tokeniser
Results
Tokeniser Evaluation Design and Datasets
Results on Clinical NLP Tasks
Effect of Variable Training Size
Discussion
Methods

Figures (2)

Figure 1: Analysis of fertility on the clinical concept extraction tasks using n2c2 dataset
Figure 2: The optimisation steps performed by the K-Tokeniser at both the word and sentence levels

Infusing clinical knowledge into tokenisers for language models

TL;DR

Abstract

Infusing clinical knowledge into tokenisers for language models

Authors

TL;DR

Abstract

Table of Contents

Figures (2)