Table of Contents
Fetching ...

Revealing Interconnections between Diseases: from Statistical Methods to Large Language Models

Alina Ermilova, Dmitrii Kornilov, Sofia Samoilova, Ekaterina Laptenkova, Anastasia Kolesnikova, Ekaterina Podplutova, Senotrusova Sofya, Maksim G. Sharaev

TL;DR

This work tackles the problem of identifying clinically meaningful interconnections between diseases by systematically comparing ten approaches across two data modalities: ICD-10 code sequences from MIMIC-IV and ICD-10 code descriptions. It spans statistical baselines (Fisher's exact test, Jaccard), a masked language modeling approach, medical-domain pretrained models (Med-BERT, BioClinicalBERT), text-based embeddings (BERT, Yandex Doc Search), and three LLMs (DeepSeek, Qwen, YandexGPT). By constructing and comparing interconnection matrices and converting them to a disease ontology, the study reveals that LLMs tend to produce fewer novel connections (low diversity) and often diverge from domain-driven patterns, while MLM and Med-BERT closely align, and text-based methods cluster by terminology. The results offer a valuable, ground-truth–free ontology for future clinical research and AI applications, while highlighting limitations in using LLMs for discovering new disease interconnections and underscoring the need for ground-truth datasets and broader population data.

Abstract

Identifying disease interconnections through manual analysis of large-scale clinical data is labor-intensive, subjective, and prone to expert disagreement. While machine learning (ML) shows promise, three critical challenges remain: (1) selecting optimal methods from the vast ML landscape, (2) determining whether real-world clinical data (e.g., electronic health records, EHRs) or structured disease descriptions yield more reliable insights, (3) the lack of "ground truth," as some disease interconnections remain unexplored in medicine. Large language models (LLMs) demonstrate broad utility, yet they often lack specialized medical knowledge. To address these gaps, we conduct a systematic evaluation of seven approaches for uncovering disease relationships based on two data sources: (i) sequences of ICD-10 codes from MIMIC-IV EHRs and (ii) the full set of ICD-10 codes, both with and without textual descriptions. Our framework integrates the following: (i) a statistical co-occurrence analysis and a masked language modeling (MLM) approach using real clinical data; (ii) domain-specific BERT variants (Med-BERT and BioClinicalBERT); (iii) a general-purpose BERT and document retrieval; and (iv) four LLMs (Mistral, DeepSeek, Qwen, and YandexGPT). Our graph-based comparison of the obtained interconnection matrices shows that the LLM-based approach produces interconnections with the lowest diversity of ICD code connections to different diseases compared to other methods, including text-based and domain-based approaches. This suggests an important implication: LLMs have limited potential for discovering new interconnections. In the absence of ground truth databases for medical interconnections between ICD codes, our results constitute a valuable medical disease ontology that can serve as a foundational resource for future clinical research and artificial intelligence applications in healthcare.

Revealing Interconnections between Diseases: from Statistical Methods to Large Language Models

TL;DR

This work tackles the problem of identifying clinically meaningful interconnections between diseases by systematically comparing ten approaches across two data modalities: ICD-10 code sequences from MIMIC-IV and ICD-10 code descriptions. It spans statistical baselines (Fisher's exact test, Jaccard), a masked language modeling approach, medical-domain pretrained models (Med-BERT, BioClinicalBERT), text-based embeddings (BERT, Yandex Doc Search), and three LLMs (DeepSeek, Qwen, YandexGPT). By constructing and comparing interconnection matrices and converting them to a disease ontology, the study reveals that LLMs tend to produce fewer novel connections (low diversity) and often diverge from domain-driven patterns, while MLM and Med-BERT closely align, and text-based methods cluster by terminology. The results offer a valuable, ground-truth–free ontology for future clinical research and AI applications, while highlighting limitations in using LLMs for discovering new disease interconnections and underscoring the need for ground-truth datasets and broader population data.

Abstract

Identifying disease interconnections through manual analysis of large-scale clinical data is labor-intensive, subjective, and prone to expert disagreement. While machine learning (ML) shows promise, three critical challenges remain: (1) selecting optimal methods from the vast ML landscape, (2) determining whether real-world clinical data (e.g., electronic health records, EHRs) or structured disease descriptions yield more reliable insights, (3) the lack of "ground truth," as some disease interconnections remain unexplored in medicine. Large language models (LLMs) demonstrate broad utility, yet they often lack specialized medical knowledge. To address these gaps, we conduct a systematic evaluation of seven approaches for uncovering disease relationships based on two data sources: (i) sequences of ICD-10 codes from MIMIC-IV EHRs and (ii) the full set of ICD-10 codes, both with and without textual descriptions. Our framework integrates the following: (i) a statistical co-occurrence analysis and a masked language modeling (MLM) approach using real clinical data; (ii) domain-specific BERT variants (Med-BERT and BioClinicalBERT); (iii) a general-purpose BERT and document retrieval; and (iv) four LLMs (Mistral, DeepSeek, Qwen, and YandexGPT). Our graph-based comparison of the obtained interconnection matrices shows that the LLM-based approach produces interconnections with the lowest diversity of ICD code connections to different diseases compared to other methods, including text-based and domain-based approaches. This suggests an important implication: LLMs have limited potential for discovering new interconnections. In the absence of ground truth databases for medical interconnections between ICD codes, our results constitute a valuable medical disease ontology that can serve as a foundational resource for future clinical research and artificial intelligence applications in healthcare.

Paper Structure

This paper contains 45 sections, 2 equations, 24 figures, 3 tables.

Figures (24)

  • Figure 1: Mean squared error (MSE) between the original Qwen model and its DeepSeek distillations.
  • Figure 2: Disease interconnection matrices of methods working with real data: statistical approaches (Fisher's exact test and Jaccard similarity) and MLM. For Fisher's exact test, we substitute all elements higher than $0.997$-quantile as $0.997$-quantile, which equals to $2262$. This technique is implemented as the number of co-occurrences for Fisher's exact test varies from $0$ to $91108$.
  • Figure 3: Disease interconnection matrices of methods pretrained on medical domain data: Med-BERT and BioClinicalBERT.
  • Figure 4: Disease interconnection matrices of methods working with ICD codes' textual descriptions: pretrainde BERT and Yandex Doc Search.
  • Figure 5: Disease interconnection matrices of LLMs.
  • ...and 19 more figures