Identification of Device Dependencies Using Link Prediction
Lukáš Sadlek, Martin Husák, Pavel Čeleda
TL;DR
The paper addresses the challenge of identifying device dependencies in large, dynamic networks using passively collected IP flows. It introduces a latent-graph, link-prediction approach that relies on time-constrained, directed random walks to generate IP-address embeddings, from which dependency embeddings are formed and used to train a dependency classifier. Key contributions include a novel constrained-walk embedding pipeline inspired by Node2Vec, the ability to detect multiple dependency types (DD, LR, RR) and transitive forms (TD, TD3), and an evaluation showing acceptable performance on cyber-defense and campus datasets with AUC around $0.63$–$0.74$ and AP around $0.74$–$0.88$. The method supports batch processing, scales to large data, and remains applicable under privacy-preserving or encrypted flows, offering a practical tool for risk analysis and network management.
Abstract
Devices in computer networks cannot work without essential network services provided by a limited count of devices. Identification of device dependencies determines whether a pair of IP addresses is a dependency, i.e., the host with the first IP address is dependent on the second one. These dependencies cannot be identified manually in large and dynamically changing networks. Nevertheless, they are important due to possible unexpected failures, performance issues, and cascading effects. We address the identification of dependencies using a new approach based on graph-based machine learning. The approach belongs to link prediction based on a latent representation of the computer network's communication graph. It samples random walks over IP addresses that fulfill time conditions imposed on network dependencies. The constrained random walks are used by a neural network to construct IP address embedding, which is a space that contains IP addresses that often appear close together in the same communication chain (i.e., random walk). Dependency embedding is constructed by combining values for IP addresses from their embedding and used for training the resulting dependency classifier. We evaluated the approach using IP flow datasets from a controlled environment and university campus network that contain evidence about dependencies. Evaluation concerning the correctness and relationship to other approaches shows that the approach achieves acceptable performance. It can simultaneously consider all types of dependencies and is applicable for batch processing in operational conditions.
