Table of Contents
Fetching ...

Community Detection in Multimodal Data: A Similarity Network Perspective

Aidan Marnane, T. Ian Simpson

TL;DR

The paper addresses how to construct reliable multi-modal similarity networks for community detection in biomedical data by introducing a synthetic data framework that systematically varies inter-modality consistency and data distributions. It evaluates five integration methods—Concatenated Features, Mean Similarity, Extreme Mean, SNF, and NEMO—across controlled scenarios and partial data, using network metrics and clustering performance to reveal method strengths and weaknesses. A key finding is that SNF and NEMO do not universally outperform simpler approaches like Mean $S_i$ or Concatenated $X_i$, especially in merged or merged-like cluster settings, while NEMO demonstrates superior robustness to partial modalities. The work provides practical guidance for method selection in multi-modal clustering and lays groundwork for extending similarity integration to more realistic, heterogeneous biomedical datasets with incomplete data.

Abstract

Similarity network construction is a fundamental step in many approaches to community detection in biomedical analysis. It is utilised both in the creation of network structures from non-relational data and as a processing step in clustering pipelines. The foundation of any network analysis approach hinges on the quality of the underlying network. With the rising popularity of network learning and use of network-based clustering, the importance of correctly constructing the network is vital. The underlying mechanisms of similarity network construction, particularly the implications of the choice of approach for multi-modal integration, remain poorly explored. By introducing differences in embedded cluster information and noise levels across modalities, we assess the performance of popular similarity integration techniques such as Similarity Network Fusion (SNF) and NEighborhood based Multi-Omics clustering (NEMO). Notably, SNF and NEMO fail to outperform simpler techniques such as mean similarity aggregation when incorporating modalities with inconsistently embedded clusters. We demonstrate how integration methods can be used to incorporate partial modalities - datasets where not all individuals have a full set of measurements in all modalities. SNF shows significant sensitivity to incomplete modalities while NEMO and mean aggregation are more resilient.

Community Detection in Multimodal Data: A Similarity Network Perspective

TL;DR

The paper addresses how to construct reliable multi-modal similarity networks for community detection in biomedical data by introducing a synthetic data framework that systematically varies inter-modality consistency and data distributions. It evaluates five integration methods—Concatenated Features, Mean Similarity, Extreme Mean, SNF, and NEMO—across controlled scenarios and partial data, using network metrics and clustering performance to reveal method strengths and weaknesses. A key finding is that SNF and NEMO do not universally outperform simpler approaches like Mean or Concatenated , especially in merged or merged-like cluster settings, while NEMO demonstrates superior robustness to partial modalities. The work provides practical guidance for method selection in multi-modal clustering and lays groundwork for extending similarity integration to more realistic, heterogeneous biomedical datasets with incomplete data.

Abstract

Similarity network construction is a fundamental step in many approaches to community detection in biomedical analysis. It is utilised both in the creation of network structures from non-relational data and as a processing step in clustering pipelines. The foundation of any network analysis approach hinges on the quality of the underlying network. With the rising popularity of network learning and use of network-based clustering, the importance of correctly constructing the network is vital. The underlying mechanisms of similarity network construction, particularly the implications of the choice of approach for multi-modal integration, remain poorly explored. By introducing differences in embedded cluster information and noise levels across modalities, we assess the performance of popular similarity integration techniques such as Similarity Network Fusion (SNF) and NEighborhood based Multi-Omics clustering (NEMO). Notably, SNF and NEMO fail to outperform simpler techniques such as mean similarity aggregation when incorporating modalities with inconsistently embedded clusters. We demonstrate how integration methods can be used to incorporate partial modalities - datasets where not all individuals have a full set of measurements in all modalities. SNF shows significant sensitivity to incomplete modalities while NEMO and mean aggregation are more resilient.

Paper Structure

This paper contains 26 sections, 14 equations, 9 figures, 1 table.

Figures (9)

  • Figure 1: Approaches to Similarity Integration in Multi-Modal Network Construction. Methods can be classified as early, intermediate or late integration techniques where one of the modality's i) data features $X_i$, ii) pairwise similarities $S_i$ or iii) individual networks $G_i$ are integrated together in order to construct a similarity network $G$ for the dataset.
  • Figure 2: Generation of Modality-Specific Clusters and Feature Distributions. This figure illustrates the possible components that can be adjusted in the process of generating modality-specific clusters and features from the ground truth labels $y$. For each modality $i$, the modality ground truth clusters $y_i$ are derived by applying one of four transformations to $y$: (i) keeping $y_i$ identical to $y$, (ii) splitting clusters in $y$ into subclusters, (iii) merging clusters in $y$, or (iv) generating random, unrelated clusters. Features $X_i$ are then generated based on $y_i$ using one of three distributions: (i) mixture of Gaussians, (ii) mixture of Student's-t, or (iii) categorical data.
  • Figure 3: Types of Partial Data in Multi-Modal Datasets This figure illustrates two scenarios of partial data in multi-modal datasets: missing data either at random or based on cluster membership. When measurement are missing based on cluster, only individuals from cluster 1 (orange) do not have measurements in modality 3 (light green). In data partial at random, there is no link between the cluster label and the partial data.
  • Figure 4: AMI Performance Comparison of Similarity Integration Methods Across Multiple Modalities. AMI performance of A) SBM B) Leiden and C) Spectral clustering algorithm on 20 instances of 15 different modality problems using Euclidean distance is presented. Five similarity integration methods are compared: SNF, NEMO, Concatenated $X_i$, Mean $S_i$ and Extreme Mean. The average performance of each clustering algorithm on a KNN network $G_i$ using each individual modality is also shown. We can see all integration methods (including simple concatenation) provide a significant improvement in performance. SNF is consistently outperformed by simpler integration methods such as Mean $S_i$ and NEMO on Leiden clustering. Both NEMO and SNF do offer improvements in the accuracy of SBM and Spectral clustering methods. A network constructed from simple concatenation matches the performance of more complex approaches on easier modality problems. However, in higher noise settings such as Noisy and Mixed Noisy assessing each modality independently (i.e. using Mean $S_i$, NEMO or SNF) provides an improvement across all clustering algorithms.
  • Figure 5: Mean and Maximum AMI Performance Comparison of Similarity Integration Methods Across Multiple Modalities. Mean and maximum clustering AMI performance on the networks of the five integration methods on 20 instances of several modality problems is shown. We select seven representative modality problems to summarise performance. On problems with multiple merged modalities --- Single Merged, Merged, Mixed Noisy, Mean $S_i$ outperforms SNF and NEMO both in Max and Mean AMI. On Split, 1Rand and Mixed 1Rand, SNF, Mean $S_i$'s max performance is quite strong. It is close in performance to SNF on all 3, outperforming it on 1Rand. Yet its mean clustering performance is significantly worse. The drop in performance is more significant than SNF's corresponding drop on merged clusters.
  • ...and 4 more figures