Table of Contents
Fetching ...

Representation Learning of Structured Data for Medical Foundation Models

Vijay Prakash Dwivedi, Viktor Schlegel, Andy T. Liu, Thanh-Tung Nguyen, Abhinav Ramesh Kashyap, Jeng Wei, Wei-Hsian Yin, Stefan Winkler, Robby T. Tan

TL;DR

The UniStruct architecture is introduced to design a multimodal medical foundation model of unstructured text and structured data, which addresses the challenges LLMs face in processing medical codes due to the shortcomings of current tokenization methods.

Abstract

Large Language Models (LLMs) have demonstrated remarkable performance across various domains, including healthcare. However, their ability to effectively represent structured non-textual data, such as the alphanumeric medical codes used in records like ICD-10 or SNOMED-CT, is limited and has been particularly exposed in recent research. This paper examines the challenges LLMs face in processing medical codes due to the shortcomings of current tokenization methods. As a result, we introduce the UniStruct architecture to design a multimodal medical foundation model of unstructured text and structured data, which addresses these challenges by adapting subword tokenization techniques specifically for the structured medical codes. Our approach is validated through model pre-training on both an extensive internal medical database and a public repository of structured medical records. Trained on over 1 billion tokens on the internal medical database, the proposed model achieves up to a 23% improvement in evaluation metrics, with around 2% gain attributed to our proposed tokenization. Additionally, when evaluated on the EHRSHOT public benchmark with a 1/1000 fraction of the pre-training data, the UniStruct model improves performance on over 42% of the downstream tasks. Our approach not only enhances the representation and generalization capabilities of patient-centric models but also bridges a critical gap in representation learning models' ability to handle complex structured medical data, alongside unstructured text.

Representation Learning of Structured Data for Medical Foundation Models

TL;DR

The UniStruct architecture is introduced to design a multimodal medical foundation model of unstructured text and structured data, which addresses the challenges LLMs face in processing medical codes due to the shortcomings of current tokenization methods.

Abstract

Large Language Models (LLMs) have demonstrated remarkable performance across various domains, including healthcare. However, their ability to effectively represent structured non-textual data, such as the alphanumeric medical codes used in records like ICD-10 or SNOMED-CT, is limited and has been particularly exposed in recent research. This paper examines the challenges LLMs face in processing medical codes due to the shortcomings of current tokenization methods. As a result, we introduce the UniStruct architecture to design a multimodal medical foundation model of unstructured text and structured data, which addresses these challenges by adapting subword tokenization techniques specifically for the structured medical codes. Our approach is validated through model pre-training on both an extensive internal medical database and a public repository of structured medical records. Trained on over 1 billion tokens on the internal medical database, the proposed model achieves up to a 23% improvement in evaluation metrics, with around 2% gain attributed to our proposed tokenization. Additionally, when evaluated on the EHRSHOT public benchmark with a 1/1000 fraction of the pre-training data, the UniStruct model improves performance on over 42% of the downstream tasks. Our approach not only enhances the representation and generalization capabilities of patient-centric models but also bridges a critical gap in representation learning models' ability to handle complex structured medical data, alongside unstructured text.

Paper Structure

This paper contains 14 sections, 6 figures, 1 table.

Figures (6)

  • Figure 1: An example illustration of Byte Pair Encoding (BPE) for an input word aaabdaaabac, with the steps that merges frequently co-occuring sub-words (Z, Y, X).
  • Figure 2: Illustration of the tokenization adopted for structured history data. All available structured data in a database is first passed to a Byte Pair Encoding (BPE) trainer to generate a custom tokenizer which is used to encode a patient's visit timeline, as illustrated with an example visit of a patient with 3 visits. The example shows a single token ID for multiple codes which occur frequent statistically, for instance the codes 32147 32369 32906 are encoded as a single token with the token ID 1624.
  • Figure 3: Overview of the UniStruct architecture. The architecture incorporates both text and structured data modality with the latter representing a patient's past history since structured data, such as medical codes, represent compressed, high-quality information about the patient's past visits. On datasets such as EHRSHOT wornow2023ehrshot, where only structured data is available, we use the structured data pipeline only. Note that the 'current visit' data consists of only 'clinical text' and not structured data, since part of structured data are the target of the model's predictions.
  • Figure 4: Department-wise evaluation on the new medical code assignment task on the internal dataset for the top 11 departments which cover more than two-third of all cases statistically. The Baseline model denotes the text only model without patient history as in Table \ref{['tab:internal_main']}.
  • Figure 5: Comparison of different models on 14 EHRShot Downstream Tasks. The tasks are grouped into three categories: operational outcomes (left), anticipating lab test results (middle), and assignment of new diagnoses (right). The evaluation metric is AUROC, with higher scores indicating better performance. The proposed UniStruct model is denoted with 'red' color while the remaining models are baseline models following wornow2023ehrshot.
  • ...and 1 more figures