Language Model Training Paradigms for Clinical Feature Embeddings
Yurong Hu, Manuel Burger, Gunnar Rätsch, Rita Kuznetsova
TL;DR
This work tackles the challenge of data-scarce clinical time-series by learning universal clinical feature embeddings through self-supervised pretraining on CBOW and MLM objectives. The authors demonstrate that these embeddings form a structured latent space aligned with clinical knowledge and can improve downstream predictions on MIMIC-III tasks, though gains relative to a strong FT-based baseline are nuanced. They validate the embeddings via unsupervised visualization and discuss interpretability advantages, suggesting that feature-level representations couple well with higher-level time-series models. The study highlights the potential and limitations of universal clinical feature embeddings and provides code for replication, paving the way for broader validation across tasks and datasets.
Abstract
In research areas with scarce data, representation learning plays a significant role. This work aims to enhance representation learning for clinical time series by deriving universal embeddings for clinical features, such as heart rate and blood pressure. We use self-supervised training paradigms for language models to learn high-quality clinical feature embeddings, achieving a finer granularity than existing time-step and patient-level representation learning. We visualize the learnt embeddings via unsupervised dimension reduction techniques and observe a high degree of consistency with prior clinical knowledge. We also evaluate the model performance on the MIMIC-III benchmark and demonstrate the effectiveness of using clinical feature embeddings. We publish our code online for replication.
