BioBridge: Unified Bio-Embedding with Bridging Modality in Code-Switched EMR
Jangyeong Jeon, Sangyeon Cho, Dongjoon Lee, Changhee Lee, Junyeong Kim
TL;DR
This work tackles the global challenge of pediatric emergency department overcrowding by leveraging NLP on code-switched Korean-English EMRs. It introduces BioBridge, a two-module framework combining bridging modality in context with unified bio-embedding to adapt encoder-based models to bilingual medical text and mediate the domain gap between general and clinical knowledge. Across a Korean-English EMR dataset, BioBridge consistently improves emergency vs. non-emergency classification over both ML baselines and standard encoder models, with notable gains in F1, AUROC, AUPRC, and Brier calibration, particularly for BioBridge-XLM and BioBridge-KR-BERT variants. The framework demonstrates robust performance across multilingual backbones and aims to inform real-time PED decision support, with public release planned to facilitate adoption and replication.
Abstract
Pediatric Emergency Department (PED) overcrowding presents a significant global challenge, prompting the need for efficient solutions. This paper introduces the BioBridge framework, a novel approach that applies Natural Language Processing (NLP) to Electronic Medical Records (EMRs) in written free-text form to enhance decision-making in PED. In non-English speaking countries, such as South Korea, EMR data is often written in a Code-Switching (CS) format that mixes the native language with English, with most code-switched English words having clinical significance. The BioBridge framework consists of two core modules: "bridging modality in context" and "unified bio-embedding." The "bridging modality in context" module improves the contextual understanding of bilingual and code-switched EMRs. In the "unified bio-embedding" module, the knowledge of the model trained in the medical domain is injected into the encoder-based model to bridge the gap between the medical and general domains. Experimental results demonstrate that the proposed BioBridge significantly performance traditional machine learning and pre-trained encoder-based models on several metrics, including F1 score, area under the receiver operating characteristic curve (AUROC), area under the precision-recall curve (AUPRC), and Brier score. Specifically, BioBridge-XLM achieved enhancements of 0.85% in F1 score, 0.75% in AUROC, and 0.76% in AUPRC, along with a notable 3.04% decrease in the Brier score, demonstrating marked improvements in accuracy, reliability, and prediction calibration over the baseline XLM model. The source code will be made publicly available.
