Table of Contents
Fetching ...

Matrix Factorization for Inferring Associations and Missing Links

Ryan Barron, Maksim E. Eren, Duc P. Truong, Cynthia Matuszek, James Wendelberger, Mary F. Dorn, Boian Alexandrov

TL;DR

The paper tackles missing link prediction by advancing matrix-factorization methods with automatic rank determination and uncertainty quantification. It introduces Weighted, Boolean, and Recommender NMFk (with ensemble variants) and integrates Boolean thresholding, Boolean perturbations, and UQ to provide reliable abstentions on uncertain predictions. The approach yields improved link-prediction performance on synthetic data and five real-world PPI networks, outperforming several LMF baselines, and is made accessible via the T-ELF Python library. These contributions enable scalable, reliable link prediction for biological networks and knowledge-graph reasoning tasks, with a practical software tool for researchers and developers.

Abstract

Missing link prediction is a method for network analysis, with applications in recommender systems, biology, social sciences, cybersecurity, information retrieval, and Artificial Intelligence (AI) reasoning in Knowledge Graphs. Missing link prediction identifies unseen but potentially existing connections in a network by analyzing the observed patterns and relationships. In proliferation detection, this supports efforts to identify and characterize attempts by state and non-state actors to acquire nuclear weapons or associated technology - a notoriously challenging but vital mission for global security. Dimensionality reduction techniques like Non-Negative Matrix Factorization (NMF) and Logistic Matrix Factorization (LMF) are effective but require selection of the matrix rank parameter, that is, of the number of hidden features, k, to avoid over/under-fitting. We introduce novel Weighted (WNMFk), Boolean (BNMFk), and Recommender (RNMFk) matrix factorization methods, along with ensemble variants incorporating logistic factorization, for link prediction. Our methods integrate automatic model determination for rank estimation by evaluating stability and accuracy using a modified bootstrap methodology and uncertainty quantification (UQ), assessing prediction reliability under random perturbations. We incorporate Otsu threshold selection and k-means clustering for Boolean matrix factorization, comparing them to coordinate descent-based Boolean thresholding. Our experiments highlight the impact of rank k selection, evaluate model performance under varying test-set sizes, and demonstrate the benefits of UQ for reliable predictions using abstention. We validate our methods on three synthetic datasets (Boolean and uniformly distributed) and benchmark them against LMF and symmetric LMF (symLMF) on five real-world protein-protein interaction networks, showcasing an improved prediction performance.

Matrix Factorization for Inferring Associations and Missing Links

TL;DR

The paper tackles missing link prediction by advancing matrix-factorization methods with automatic rank determination and uncertainty quantification. It introduces Weighted, Boolean, and Recommender NMFk (with ensemble variants) and integrates Boolean thresholding, Boolean perturbations, and UQ to provide reliable abstentions on uncertain predictions. The approach yields improved link-prediction performance on synthetic data and five real-world PPI networks, outperforming several LMF baselines, and is made accessible via the T-ELF Python library. These contributions enable scalable, reliable link prediction for biological networks and knowledge-graph reasoning tasks, with a practical software tool for researchers and developers.

Abstract

Missing link prediction is a method for network analysis, with applications in recommender systems, biology, social sciences, cybersecurity, information retrieval, and Artificial Intelligence (AI) reasoning in Knowledge Graphs. Missing link prediction identifies unseen but potentially existing connections in a network by analyzing the observed patterns and relationships. In proliferation detection, this supports efforts to identify and characterize attempts by state and non-state actors to acquire nuclear weapons or associated technology - a notoriously challenging but vital mission for global security. Dimensionality reduction techniques like Non-Negative Matrix Factorization (NMF) and Logistic Matrix Factorization (LMF) are effective but require selection of the matrix rank parameter, that is, of the number of hidden features, k, to avoid over/under-fitting. We introduce novel Weighted (WNMFk), Boolean (BNMFk), and Recommender (RNMFk) matrix factorization methods, along with ensemble variants incorporating logistic factorization, for link prediction. Our methods integrate automatic model determination for rank estimation by evaluating stability and accuracy using a modified bootstrap methodology and uncertainty quantification (UQ), assessing prediction reliability under random perturbations. We incorporate Otsu threshold selection and k-means clustering for Boolean matrix factorization, comparing them to coordinate descent-based Boolean thresholding. Our experiments highlight the impact of rank k selection, evaluate model performance under varying test-set sizes, and demonstrate the benefits of UQ for reliable predictions using abstention. We validate our methods on three synthetic datasets (Boolean and uniformly distributed) and benchmark them against LMF and symmetric LMF (symLMF) on five real-world protein-protein interaction networks, showcasing an improved prediction performance.

Paper Structure

This paper contains 35 sections, 52 equations, 14 figures, 3 tables, 1 algorithm.

Figures (14)

  • Figure 1: Sample Silhouette and relative error graphs obtained from NMFk applied to 615 malware specimens are shown 10.1145/3624567.
  • Figure 2: Dog Dataset - Four binary images are used as Boolean latent features to generate the synthetic data of shape $400 \times 16$.
  • Figure 3: Swimmer Dataset - Dataset of 16 swimmer images. The first and third rows are the images with real-valued intensities ranging from 0 to 19. The second and fourth rows display the Boolean versions obtained after applying Otsu thresholding. For our analysis, we use the Boolean versions of the dataset, represented as a matrix of size $1024 \times 256$.
  • Figure 4: Results for Dog Data across methods including $\text{BNMFk}_{\text{kmeans}}$, NMFk, and WNMFk. Boolean thresholding is not used for NMFk and WNMFk. The first row presents violin plots visualizing the rank $k$ predictions at different test-set size levels. The second row displays RMSE scores for the test set, demonstrating the missing link prediction performance. The results are reported for each rank $k$ on the x-axis, with the dark/dashed vertical line across columns being the true rank $k=4$.
  • Figure 5: Results for Dog Data across methods, including BNMFk, NMFk, and WNMFk, evaluated under different Boolean thresholding techniques. The Boolean thresholding techniques are denoted with the subscripts of kmeans, otsu, and search (coordinate descent). The first row presents violin plots visualizing the rank $k$ predictions at different test-set sizes. The second row displays RMSE scores for the test set, demonstrating the missing link prediction performance. The results are reported for each rank $k$ on the x-axis, with the dark and dashed vertical line across each column is the true rank $k=4$.
  • ...and 9 more figures