Table of Contents
Fetching ...

Large scale statistically validated comorbidity networks

Paride Crisafulli, Tobias Galla, Antti Karlsson, Salvatore Miccichè, Jyrki Piilo, Rosario N. Mantegna

TL;DR

The paper addresses understanding comorbidity patterns by constructing large-scale networks from Finnish electronic health records, using ICD-10 level-4 codes for high-resolution disease labeling. It builds a bipartite patient-disease network, projects to disease-only PROJ networks, and applies statistical validation with FDR to generate statistically validated networks (SVNs); disease communities are detected with Infomap and organized into a hierarchical tree to reveal cross-cohort similarities. Key findings include that SVNs are sparser but retain substantial disease information, exhibit multiple robust disease communities that differ by age and sex yet cluster into shared regions, and show widespread over-expression of ICD categories within communities; a dismantling analysis highlights which disease categories most sustain network cohesion and how this varies by cohort. The approach provides actionable insights for healthcare policy and personalized medicine by revealing cohort-specific, interpretable groups of comorbidities and identifying central disease categories for targeted interventions, while benefiting from high-resolution ICD-10 level data and statistically rigorous validation. Limitations include reliance on retrospective, region-specific data; future work could incorporate temporal dynamics and cross-population validation to generalize the findings.

Abstract

We obtain comorbidity networks starting from medical information stored in electronic health records collected by the Wellbeing Services County of Southwest Finland (Varha). Based on the data, we associate each patient to one or more diseases and construct complex comorbidity networks associated with large patient cohorts characterized by an age interval and sex. The information about diseases in electronic health records is coded using the highest granularity present in the international classification of diseases (ICD codes) provided by the World Health Organization. We statistically validate links in each cohort comorbidity network and furthermore partition the networks into communities of diseases. These are characterized by the over-expression of a few disease categories, and communities from different age or sex cohorts show various similarities in terms of these disease classes. Moreover, all the detected communities for all the cohorts can be organized into a hierarchical tree. This allows us to observe a number of clusters of communities, originating from diverse age and sex cohorts, that group together communities characterized by the same disease classes. We also perform a dismantling procedure of statistically validated comorbidity networks to highlight those categories of diseases that are most responsible for the compactedness of the comorbidity networks for a given cohort of patients.

Large scale statistically validated comorbidity networks

TL;DR

The paper addresses understanding comorbidity patterns by constructing large-scale networks from Finnish electronic health records, using ICD-10 level-4 codes for high-resolution disease labeling. It builds a bipartite patient-disease network, projects to disease-only PROJ networks, and applies statistical validation with FDR to generate statistically validated networks (SVNs); disease communities are detected with Infomap and organized into a hierarchical tree to reveal cross-cohort similarities. Key findings include that SVNs are sparser but retain substantial disease information, exhibit multiple robust disease communities that differ by age and sex yet cluster into shared regions, and show widespread over-expression of ICD categories within communities; a dismantling analysis highlights which disease categories most sustain network cohesion and how this varies by cohort. The approach provides actionable insights for healthcare policy and personalized medicine by revealing cohort-specific, interpretable groups of comorbidities and identifying central disease categories for targeted interventions, while benefiting from high-resolution ICD-10 level data and statistically rigorous validation. Limitations include reliance on retrospective, region-specific data; future work could incorporate temporal dynamics and cross-population validation to generalize the findings.

Abstract

We obtain comorbidity networks starting from medical information stored in electronic health records collected by the Wellbeing Services County of Southwest Finland (Varha). Based on the data, we associate each patient to one or more diseases and construct complex comorbidity networks associated with large patient cohorts characterized by an age interval and sex. The information about diseases in electronic health records is coded using the highest granularity present in the international classification of diseases (ICD codes) provided by the World Health Organization. We statistically validate links in each cohort comorbidity network and furthermore partition the networks into communities of diseases. These are characterized by the over-expression of a few disease categories, and communities from different age or sex cohorts show various similarities in terms of these disease classes. Moreover, all the detected communities for all the cohorts can be organized into a hierarchical tree. This allows us to observe a number of clusters of communities, originating from diverse age and sex cohorts, that group together communities characterized by the same disease classes. We also perform a dismantling procedure of statistically validated comorbidity networks to highlight those categories of diseases that are most responsible for the compactedness of the comorbidity networks for a given cohort of patients.

Paper Structure

This paper contains 27 sections, 3 equations, 15 figures, 4 tables.

Figures (15)

  • Figure 1: Complementary cumulative distribution function of the degree of ICD nodes (top panels) and of patient nodes (bottom panels) in the bipartite network for all patient cohorts of females (left) and males (right).
  • Figure 2: Statistics of ICD category composition of SVNs for all cohorts of patients. ICD categories (columns) for different age cohorts (rows) and sexes (females in the upper panel and males in the lower). The number in brackets next to the age limits of the cohort is the total number of nodes in the SVN. The number inside the squares is the number of nodes of the disease class (the letter in the ICD code) in the given network, while the color is proportional to the fraction of diseases that belong to that class
  • Figure 3: Selected region of the average linkage hierarchical tree of the set of SVN communities with number of nodes larger than 25. The selected region contains the pair of ICD communities with the highest Jaccard similarity (c5_70-79_M and c11_60-69_M). All communities except one (c5_0-9_M) in this cluster present an over-expression of the ICD category H.
  • Figure 4: Selected region of the average linkage hierarchical tree with communities located at positions from 199 to 204. The over-expression of disease category for each community starting from left is as follows F, FG, FG, F, F, FG.
  • Figure 5: Selected region of the average linkage hierarchical tree with communities located at positions from 291 to 305. The over-expression of categories for each community, starting from left, is as follows FYZ, FZ, FZ, FZ, FS, F, FZ, FZ, F, F, F, F, F, F, F.
  • ...and 10 more figures