Table of Contents
Fetching ...

Cyber-Security Knowledge Graph Generation by Hierarchical Nonnegative Matrix Factorization

Ryan Barron, Maksim E. Eren, Manish Bhattarai, Selma Wanna, Nicholas Solovyev, Kim Rasmussen, Boian S. Alexandrov, Charles Nicholas, Cynthia Matuszek

TL;DR

This paper tackles the challenge of organizing vast cybersecurity literature by building a domain-specific, multi-modal knowledge graph from unstructured text. It introduces HSNMFk-SPLIT, a hierarchical, semantic nonnegative matrix factorization framework with automatic topic-number estimation for joint factorization of document, semantic, and category data, enabling scalable extraction of topics, keywords, and named entities. The method is demonstrated on a corpus of over 2 million arXiv papers, yielding 24 super-topics and a cyber-focused KG with thousands of nodes and edges, linking documents, NERs, and topic structures (e.g., Adversarial ML) in Neo4j. This approach supports discovering emerging trends and targeted research areas in cybersecurity by enabling complex queries over observable and latent KG components, with explicit mathematical formulation guiding topic discovery and KG construction: $\min_{\mathbf{W,H,G,J}} \frac{1}{2} \|\mathbf{X}-\mathbf{WH}\|_{F}^{2} + \alpha \|\mathbf{S}-\mathbf{WG}\|_{F}^{2} + \beta \|\mathbf{C}-\mathbf{WJ}\|_{F}^{2}$. The work highlights practical impact for domain-specific knowledge management and literature exploration in cybersecurity.

Abstract

Much of human knowledge in cybersecurity is encapsulated within the ever-growing volume of scientific papers. As this textual data continues to expand, the importance of document organization methods becomes increasingly crucial for extracting actionable insights hidden within large text datasets. Knowledge Graphs (KGs) serve as a means to store factual information in a structured manner, providing explicit, interpretable knowledge that includes domain-specific information from the cybersecurity scientific literature. One of the challenges in constructing a KG from scientific literature is the extraction of ontology from unstructured text. In this paper, we address this topic and introduce a method for building a multi-modal KG by extracting structured ontology from scientific papers. We demonstrate this concept in the cybersecurity domain. One modality of the KG represents observable information from the papers, such as the categories in which they were published or the authors. The second modality uncovers latent (hidden) patterns of text extracted through hierarchical and semantic non-negative matrix factorization (NMF), such as named entities, topics or clusters, and keywords. We illustrate this concept by consolidating more than two million scientific papers uploaded to arXiv into the cyber-domain, using hierarchical and semantic NMF, and by building a cyber-domain-specific KG.

Cyber-Security Knowledge Graph Generation by Hierarchical Nonnegative Matrix Factorization

TL;DR

This paper tackles the challenge of organizing vast cybersecurity literature by building a domain-specific, multi-modal knowledge graph from unstructured text. It introduces HSNMFk-SPLIT, a hierarchical, semantic nonnegative matrix factorization framework with automatic topic-number estimation for joint factorization of document, semantic, and category data, enabling scalable extraction of topics, keywords, and named entities. The method is demonstrated on a corpus of over 2 million arXiv papers, yielding 24 super-topics and a cyber-focused KG with thousands of nodes and edges, linking documents, NERs, and topic structures (e.g., Adversarial ML) in Neo4j. This approach supports discovering emerging trends and targeted research areas in cybersecurity by enabling complex queries over observable and latent KG components, with explicit mathematical formulation guiding topic discovery and KG construction: . The work highlights practical impact for domain-specific knowledge management and literature exploration in cybersecurity.

Abstract

Much of human knowledge in cybersecurity is encapsulated within the ever-growing volume of scientific papers. As this textual data continues to expand, the importance of document organization methods becomes increasingly crucial for extracting actionable insights hidden within large text datasets. Knowledge Graphs (KGs) serve as a means to store factual information in a structured manner, providing explicit, interpretable knowledge that includes domain-specific information from the cybersecurity scientific literature. One of the challenges in constructing a KG from scientific literature is the extraction of ontology from unstructured text. In this paper, we address this topic and introduce a method for building a multi-modal KG by extracting structured ontology from scientific papers. We demonstrate this concept in the cybersecurity domain. One modality of the KG represents observable information from the papers, such as the categories in which they were published or the authors. The second modality uncovers latent (hidden) patterns of text extracted through hierarchical and semantic non-negative matrix factorization (NMF), such as named entities, topics or clusters, and keywords. We illustrate this concept by consolidating more than two million scientific papers uploaded to arXiv into the cyber-domain, using hierarchical and semantic NMF, and by building a cyber-domain-specific KG.
Paper Structure (10 sections, 1 equation, 3 figures)

This paper contains 10 sections, 1 equation, 3 figures.

Figures (3)

  • Figure 1: Word cloud of selected topics and their interpreted labels. Method is hierarchically applied three times, selecting a topic to extract its sub-topics at each level. Selected topics from each level are shown in each row of the figure.
  • Figure 2: Data pipeline from starting from arXiv data (1), text cleaning (2), running HSNMFk-SPLIT for topics (3), extracting the keywords for the decomposition (4), extracting the Named Entities per document (5), and then structurally aggregating the data into the knowledge graph (6). Images generated with DALL·E dalle_tensor_decomp_arxiv_images.
  • Figure 3: Distribution of the top ten document categories, based on the arXiv author assignments. Each topic corresponds to the word clouds from Figure \ref{['fig:topics']}. Selected topics from each level are shown in each row of the figure.