Table of Contents
Fetching ...

BioBridge: Unified Bio-Embedding with Bridging Modality in Code-Switched EMR

Jangyeong Jeon, Sangyeon Cho, Dongjoon Lee, Changhee Lee, Junyeong Kim

TL;DR

This work tackles the global challenge of pediatric emergency department overcrowding by leveraging NLP on code-switched Korean-English EMRs. It introduces BioBridge, a two-module framework combining bridging modality in context with unified bio-embedding to adapt encoder-based models to bilingual medical text and mediate the domain gap between general and clinical knowledge. Across a Korean-English EMR dataset, BioBridge consistently improves emergency vs. non-emergency classification over both ML baselines and standard encoder models, with notable gains in F1, AUROC, AUPRC, and Brier calibration, particularly for BioBridge-XLM and BioBridge-KR-BERT variants. The framework demonstrates robust performance across multilingual backbones and aims to inform real-time PED decision support, with public release planned to facilitate adoption and replication.

Abstract

Pediatric Emergency Department (PED) overcrowding presents a significant global challenge, prompting the need for efficient solutions. This paper introduces the BioBridge framework, a novel approach that applies Natural Language Processing (NLP) to Electronic Medical Records (EMRs) in written free-text form to enhance decision-making in PED. In non-English speaking countries, such as South Korea, EMR data is often written in a Code-Switching (CS) format that mixes the native language with English, with most code-switched English words having clinical significance. The BioBridge framework consists of two core modules: "bridging modality in context" and "unified bio-embedding." The "bridging modality in context" module improves the contextual understanding of bilingual and code-switched EMRs. In the "unified bio-embedding" module, the knowledge of the model trained in the medical domain is injected into the encoder-based model to bridge the gap between the medical and general domains. Experimental results demonstrate that the proposed BioBridge significantly performance traditional machine learning and pre-trained encoder-based models on several metrics, including F1 score, area under the receiver operating characteristic curve (AUROC), area under the precision-recall curve (AUPRC), and Brier score. Specifically, BioBridge-XLM achieved enhancements of 0.85% in F1 score, 0.75% in AUROC, and 0.76% in AUPRC, along with a notable 3.04% decrease in the Brier score, demonstrating marked improvements in accuracy, reliability, and prediction calibration over the baseline XLM model. The source code will be made publicly available.

BioBridge: Unified Bio-Embedding with Bridging Modality in Code-Switched EMR

TL;DR

This work tackles the global challenge of pediatric emergency department overcrowding by leveraging NLP on code-switched Korean-English EMRs. It introduces BioBridge, a two-module framework combining bridging modality in context with unified bio-embedding to adapt encoder-based models to bilingual medical text and mediate the domain gap between general and clinical knowledge. Across a Korean-English EMR dataset, BioBridge consistently improves emergency vs. non-emergency classification over both ML baselines and standard encoder models, with notable gains in F1, AUROC, AUPRC, and Brier calibration, particularly for BioBridge-XLM and BioBridge-KR-BERT variants. The framework demonstrates robust performance across multilingual backbones and aims to inform real-time PED decision support, with public release planned to facilitate adoption and replication.

Abstract

Pediatric Emergency Department (PED) overcrowding presents a significant global challenge, prompting the need for efficient solutions. This paper introduces the BioBridge framework, a novel approach that applies Natural Language Processing (NLP) to Electronic Medical Records (EMRs) in written free-text form to enhance decision-making in PED. In non-English speaking countries, such as South Korea, EMR data is often written in a Code-Switching (CS) format that mixes the native language with English, with most code-switched English words having clinical significance. The BioBridge framework consists of two core modules: "bridging modality in context" and "unified bio-embedding." The "bridging modality in context" module improves the contextual understanding of bilingual and code-switched EMRs. In the "unified bio-embedding" module, the knowledge of the model trained in the medical domain is injected into the encoder-based model to bridge the gap between the medical and general domains. Experimental results demonstrate that the proposed BioBridge significantly performance traditional machine learning and pre-trained encoder-based models on several metrics, including F1 score, area under the receiver operating characteristic curve (AUROC), area under the precision-recall curve (AUPRC), and Brier score. Specifically, BioBridge-XLM achieved enhancements of 0.85% in F1 score, 0.75% in AUROC, and 0.76% in AUPRC, along with a notable 3.04% decrease in the Brier score, demonstrating marked improvements in accuracy, reliability, and prediction calibration over the baseline XLM model. The source code will be made publicly available.

Paper Structure

This paper contains 25 sections, 7 equations, 2 figures, 5 tables.

Figures (2)

  • Figure 1: Overview of BioBridge. (a) In the bridging modality in context module, tokenized text input $x^{tok}_{i}$ is processed to reconstruct ${x}_{i}^{bri} =\{\text{[CLS]},\text{[B-K]},\text{[tokens]}^{kor}, \text{[B-E]}, \text{[tokens]}^{eng}, \text{[SEP]}\}$, where modalities for each language are separated using segment tokens. Where "[B-K]" and "[B-E]" are segment tokens that distinctly identify Korean and English. In the unified bio-embedding module, English tokens within ${x}_{i}^{bri}$ are restructured at the word level to $\{x^{Eng}_{i}\}^{b}_{i=0}$ which then serve as input for $f^{B}_{\theta}(\{x^{Eng}_{i}\}^{b}_{i=0}) \in \mathbb{R}^{m \times h_{B}}$ to extract medical features. These features are subsequently input into ${f}^{L}_{\theta}$, mapping them to the dimension space $h_\mathcal{M}$ of the pre-trained encoder $\mathcal{M}$.
  • Figure 2: Preprocessing Example of Present Illness (PI) Texts in an Electronic Medical Record (EMR).