Table of Contents
Fetching ...

Unveiling Discrete Clues: Superior Healthcare Predictions for Rare Diseases

Chuang Zhao, Hui Tang, Jiheng Zhang, Xiaomeng Li

TL;DR

The paper addresses the challenge of predicting outcomes for rare diseases in electronic health records by bridging textual knowledge with co-occurrence signals in a unified discrete latent space. It introduces UDC, a tailored VQ-VAE framework with condition-aware and task-aware calibrations plus a co-teacher distillation strategy to align text and co-occurrence signals at the code level. The method proceeds in three stages—pretraining a collaborative model, learning discrete Text-CO alignment, and fine-tuning with DRL frozen—showing superior performance on three large healthcare datasets for both diagnosis prediction and medication recommendation, especially for rare diseases. The approach enhances representation semantics, enables bidirectional Text-CO transfer, and demonstrates robustness across diverse backbones and external knowledge sources, offering practical impact for clinical decision support. Limitations include multimodal integration beyond text, which the authors propose for future work.

Abstract

Accurate healthcare prediction is essential for improving patient outcomes. Existing work primarily leverages advanced frameworks like attention or graph networks to capture the intricate collaborative (CO) signals in electronic health records. However, prediction for rare diseases remains challenging due to limited co-occurrence and inadequately tailored approaches. To address this issue, this paper proposes UDC, a novel method that unveils discrete clues to bridge consistent textual knowledge and CO signals within a unified semantic space, thereby enriching the representation semantics of rare diseases. Specifically, we focus on addressing two key sub-problems: (1) acquiring distinguishable discrete encodings for precise disease representation and (2) achieving semantic alignment between textual knowledge and the CO signals at the code level. For the first sub-problem, we refine the standard vector quantized process to include condition awareness. Additionally, we develop an advanced contrastive approach in the decoding stage, leveraging synthetic and mixed-domain targets as hard negatives to enrich the perceptibility of the reconstructed representation for downstream tasks. For the second sub-problem, we introduce a novel codebook update strategy using co-teacher distillation. This approach facilitates bidirectional supervision between textual knowledge and CO signals, thereby aligning semantically equivalent information in a shared discrete latent space. Extensive experiments on three datasets demonstrate our superiority.

Unveiling Discrete Clues: Superior Healthcare Predictions for Rare Diseases

TL;DR

The paper addresses the challenge of predicting outcomes for rare diseases in electronic health records by bridging textual knowledge with co-occurrence signals in a unified discrete latent space. It introduces UDC, a tailored VQ-VAE framework with condition-aware and task-aware calibrations plus a co-teacher distillation strategy to align text and co-occurrence signals at the code level. The method proceeds in three stages—pretraining a collaborative model, learning discrete Text-CO alignment, and fine-tuning with DRL frozen—showing superior performance on three large healthcare datasets for both diagnosis prediction and medication recommendation, especially for rare diseases. The approach enhances representation semantics, enables bidirectional Text-CO transfer, and demonstrates robustness across diverse backbones and external knowledge sources, offering practical impact for clinical decision support. Limitations include multimodal integration beyond text, which the authors propose for future work.

Abstract

Accurate healthcare prediction is essential for improving patient outcomes. Existing work primarily leverages advanced frameworks like attention or graph networks to capture the intricate collaborative (CO) signals in electronic health records. However, prediction for rare diseases remains challenging due to limited co-occurrence and inadequately tailored approaches. To address this issue, this paper proposes UDC, a novel method that unveils discrete clues to bridge consistent textual knowledge and CO signals within a unified semantic space, thereby enriching the representation semantics of rare diseases. Specifically, we focus on addressing two key sub-problems: (1) acquiring distinguishable discrete encodings for precise disease representation and (2) achieving semantic alignment between textual knowledge and the CO signals at the code level. For the first sub-problem, we refine the standard vector quantized process to include condition awareness. Additionally, we develop an advanced contrastive approach in the decoding stage, leveraging synthetic and mixed-domain targets as hard negatives to enrich the perceptibility of the reconstructed representation for downstream tasks. For the second sub-problem, we introduce a novel codebook update strategy using co-teacher distillation. This approach facilitates bidirectional supervision between textual knowledge and CO signals, thereby aligning semantically equivalent information in a shared discrete latent space. Extensive experiments on three datasets demonstrate our superiority.

Paper Structure

This paper contains 28 sections, 16 equations, 10 figures, 6 tables, 1 algorithm.

Figures (10)

  • Figure 1: (a) Disease occurrences across three datasets. (b) Medication recommendation for commonest / rarest diseases.
  • Figure 2: Overview of UDC. We pre-train the PCM to establish a robust CO space and then obtain CO and text representations for diseases using PCM and a selected PLM. Next, we train the DRL to align the text and CO signals, followed by fine-tuning the PCM for downstream tasks while keeping the DRL frozen. Q, K, and V denote the parameters for multi-head attention.
  • Figure 3: Group Analysis.
  • Figure 4: Plug-in Application (Diverse PCM). We choose MoleRec, SHAPE, RAREMed, and SeqCare, as they are flexible to PCM.
  • Figure 5: Plug-in Application (Diverse PLM). We select HAR, GraphCare, and SeqCare that utilize external knowledge.
  • ...and 5 more figures