Infusing clinical knowledge into tokenisers for language models
Abul Hasan, Jinge Wu, Quang Ngoc Nguyen, Salomé Andres, Imane Guellil, Huayu Zhang, Arlene Casey, Beatrice Alex, Bruce Guthrie, Honghan Wu
TL;DR
The paper addresses the bottleneck of tokenisation in clinical LMs by introducing K-Tokeniser, a knowledge-infused tokeniser that augments baseline vocabularies with semantic-type subwords derived from UMLS, MIMIC-III, or PubMed. It integrates global semantic representations with local sentence context through Word Optimisation (entropy minimisation) and Sequence Optimisation (fertility-based switching) without requiring pretraining. Across four clinical NLP tasks and multiple models, K-Tokeniser yields consistent performance gains (notably up to Micro $F_1$ improvements of 13% in automated coding) and demonstrates faster convergence with reduced data needs. This approach offers a generalisable pathway to incorporate domain knowledge into tokenisation, improving efficiency and effectiveness in clinical text analytics without costly pretraining.
Abstract
This study introduces a novel knowledge enhanced tokenisation mechanism, K-Tokeniser, for clinical text processing. Technically, at initialisation stage, K-Tokeniser populates global representations of tokens based on semantic types of domain concepts (such as drugs or diseases) from either a domain ontology like Unified Medical Language System or the training data of the task related corpus. At training or inference stage, sentence level localised context will be utilised for choosing the optimal global token representation to realise the semantic-based tokenisation. To avoid pretraining using the new tokeniser, an embedding initialisation approach is proposed to generate representations for new tokens. Using three transformer-based language models, a comprehensive set of experiments are conducted on four real-world datasets for evaluating K-Tokeniser in a wide range of clinical text analytics tasks including clinical concept and relation extraction, automated clinical coding, clinical phenotype identification, and clinical research article classification. Overall, our models demonstrate consistent improvements over their counterparts in all tasks. In particular, substantial improvements are observed in the automated clinical coding task with 13\% increase on Micro $F_1$ score. Furthermore, K-Tokeniser also shows significant capacities in facilitating quicker converge of language models. Specifically, using K-Tokeniser, the language models would only require 50\% of the training data to achieve the best performance of the baseline tokeniser using all training data in the concept extraction task and less than 20\% of the data for the automated coding task. It is worth mentioning that all these improvements require no pre-training process, making the approach generalisable.
