Table of Contents
Fetching ...

SNOBERT: A Benchmark for clinical notes entity linking in the SNOMED CT clinical terminology

Mikhail Kulyabin, Gleb Sokolov, Aleksandr Galaida, Andreas Maier, Tomas Arias-Vergara

TL;DR

SNOBERT tackles the challenging problem of linking unstructured clinical notes to SNOMED CT concepts by introducing a two-stage pipeline: a Candidate Selection stage that identifies potential mention spans with a domain-specific BERT model, followed by a Candidate Matching stage that aligns these spans to SNOMED CT concepts via embedding similarity and a reranker. Trained on a large MIMIC-IV-Note cohort and evaluated in the SNOMED CT Entity Linking Challenge, SNOBERT achieves competitive performance against baselines, with the best gains coming from robust candidate ranking and pretraining strategies. The work highlights zero-shot and long-tail issues in medical EL and suggests avenues like end-to-end approaches and synthetic data from large language models to overcome annotation scarcity. Overall, SNOBERT provides a practical, scalable framework for enabling automated clinical coding and structured data extraction from free-text notes.

Abstract

The extraction and analysis of insights from medical data, primarily stored in free-text formats by healthcare workers, presents significant challenges due to its unstructured nature. Medical coding, a crucial process in healthcare, remains minimally automated due to the complexity of medical ontologies and restricted access to medical texts for training Natural Language Processing models. In this paper, we proposed a method, "SNOBERT," of linking text spans in clinical notes to specific concepts in the SNOMED CT using BERT-based models. The method consists of two stages: candidate selection and candidate matching. The models were trained on one of the largest publicly available dataset of labeled clinical notes. SNOBERT outperforms other classical methods based on deep learning, as confirmed by the results of a challenge in which it was applied.

SNOBERT: A Benchmark for clinical notes entity linking in the SNOMED CT clinical terminology

TL;DR

SNOBERT tackles the challenging problem of linking unstructured clinical notes to SNOMED CT concepts by introducing a two-stage pipeline: a Candidate Selection stage that identifies potential mention spans with a domain-specific BERT model, followed by a Candidate Matching stage that aligns these spans to SNOMED CT concepts via embedding similarity and a reranker. Trained on a large MIMIC-IV-Note cohort and evaluated in the SNOMED CT Entity Linking Challenge, SNOBERT achieves competitive performance against baselines, with the best gains coming from robust candidate ranking and pretraining strategies. The work highlights zero-shot and long-tail issues in medical EL and suggests avenues like end-to-end approaches and synthetic data from large language models to overcome annotation scarcity. Overall, SNOBERT provides a practical, scalable framework for enabling automated clinical coding and structured data extraction from free-text notes.

Abstract

The extraction and analysis of insights from medical data, primarily stored in free-text formats by healthcare workers, presents significant challenges due to its unstructured nature. Medical coding, a crucial process in healthcare, remains minimally automated due to the complexity of medical ontologies and restricted access to medical texts for training Natural Language Processing models. In this paper, we proposed a method, "SNOBERT," of linking text spans in clinical notes to specific concepts in the SNOMED CT using BERT-based models. The method consists of two stages: candidate selection and candidate matching. The models were trained on one of the largest publicly available dataset of labeled clinical notes. SNOBERT outperforms other classical methods based on deep learning, as confirmed by the results of a challenge in which it was applied.
Paper Structure (12 sections, 4 equations, 4 figures, 3 tables)

This paper contains 12 sections, 4 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Example of synthetic discharge note annotated according to SMOMED CT terminology, e.g. "blockage" corresponds to "235918000" concept ID.
  • Figure 2: Distribution of the concepts in the annotated dataset: "long tail" distribution effect.
  • Figure 3: SNOBERT scheme. The method consists of two stages. In the Candidate Selection stage (I), the BERT model is utilized to classify the text's tokens into seven classes. In the Candidate Matching stage (II), the Mention Encoder matches the extracted embeddings from the training and testing datasets within these classes. Reranker is used to rerank the top matches and to get the final similarity score.
  • Figure 4: Training pipeline of the proposed approach. The method uses a two-stage solution: Candidate Selection and Candidate Matching. All the models were trained on the MIMIC-IV dataset. The model from the first stage was trained on the annotated training subset. An optional pretrain step was done on the full unlabeled dataset. Models were evaluated on the test annotated subset.