Table of Contents
Fetching ...

A review of feature selection strategies utilizing graph data structures and knowledge graphs

Sisi Shao, Pedro Henrique Ribeiro, Christina Ramirez, Jason H. Moore

TL;DR

This paper delves into the methodologies for feature selection within KGs, emphasizing their roles in enhancing machine learning model efficacy, hypothesis generation, and interpretability and highlighting the burgeoning potential of multi-objective optimization and interdisciplinary collaboration in advancing KG FS.

Abstract

Feature selection in Knowledge Graphs (KGs) are increasingly utilized in diverse domains, including biomedical research, Natural Language Processing (NLP), and personalized recommendation systems. This paper delves into the methodologies for feature selection within KGs, emphasizing their roles in enhancing machine learning (ML) model efficacy, hypothesis generation, and interpretability. Through this comprehensive review, we aim to catalyze further innovation in feature selection for KGs, paving the way for more insightful, efficient, and interpretable analytical models across various domains. Our exploration reveals the critical importance of scalability, accuracy, and interpretability in feature selection techniques, advocating for the integration of domain knowledge to refine the selection process. We highlight the burgeoning potential of multi-objective optimization and interdisciplinary collaboration in advancing KG feature selection, underscoring the transformative impact of such methodologies on precision medicine, among other fields. The paper concludes by charting future directions, including the development of scalable, dynamic feature selection algorithms and the integration of explainable AI principles to foster transparency and trust in KG-driven models.

A review of feature selection strategies utilizing graph data structures and knowledge graphs

TL;DR

This paper delves into the methodologies for feature selection within KGs, emphasizing their roles in enhancing machine learning model efficacy, hypothesis generation, and interpretability and highlighting the burgeoning potential of multi-objective optimization and interdisciplinary collaboration in advancing KG FS.

Abstract

Feature selection in Knowledge Graphs (KGs) are increasingly utilized in diverse domains, including biomedical research, Natural Language Processing (NLP), and personalized recommendation systems. This paper delves into the methodologies for feature selection within KGs, emphasizing their roles in enhancing machine learning (ML) model efficacy, hypothesis generation, and interpretability. Through this comprehensive review, we aim to catalyze further innovation in feature selection for KGs, paving the way for more insightful, efficient, and interpretable analytical models across various domains. Our exploration reveals the critical importance of scalability, accuracy, and interpretability in feature selection techniques, advocating for the integration of domain knowledge to refine the selection process. We highlight the burgeoning potential of multi-objective optimization and interdisciplinary collaboration in advancing KG feature selection, underscoring the transformative impact of such methodologies on precision medicine, among other fields. The paper concludes by charting future directions, including the development of scalable, dynamic feature selection algorithms and the integration of explainable AI principles to foster transparency and trust in KG-driven models.
Paper Structure (27 sections, 7 figures, 2 tables)

This paper contains 27 sections, 7 figures, 2 tables.

Figures (7)

  • Figure 1: An integrated overview of KGs encompassing RDF structuring, Ontological frameworks, and GDB management, illustrating the flow from data sources to semantic querying and storage. Figure \ref{['fig:kg-construction']} delineates the contribution of varied scholarly and scientific data sources—such as Google Scholar, PubMed, arXiv, and DrugBank—in providing raw data inputs. These inputs are then semantically encoded via the RDF, using triples that consist of subjects, predicates, and objects, alongside URIs that ensure the unique identification and integration of data entities across the KG. At the heart of the semantic structure are ontologies, exemplified here by the Unified Medical Language System (UMLS), which define the schema for the KG by outlining the essential relationships and attributes of the domain-specific entities. This ontology-based schema informs the organization and representation of knowledge within GDBs, such as Neo4j, which are specialized for storing and operationalizing the complex relational data of KGs. The central round-edged box showcases the role of query languages, with Cypher portrayed as a model for extracting information from GDBs through its intuitive syntax and pattern matching capabilities. The graphic elucidation of the query output illustrates a network of nodes and edges, representing the intricate interrelations and potential analytical insights derived from KGs. Each cluster within the network, designated as A, B, and C, symbolizes distinct subsets or aspects of the graph database that have been queried.
  • Figure 2: A Tiny-sized ADKG (Yellow Node: AD; Purple Nodes: Genes; Green Nodes: Drugs) alzheimersknowledgebase. There are five instances of the "Chemical binds gene" relationship (light purple arrows), where a chemical is shown to interact directly with a gene; six instances of the "Gene associates with disease" relationship (yellow arrows), representing genes that have an association with AD; one instance of the "Chemical decreases expression" relationship (dark green arrow), indicating a chemical that downregulates or decreases the expression of a gene; one instance of "Gene regulates gene" (purple arrow), suggesting a regulatory interaction between two genes, PPARG and TPI1. More detailed information on genes and drugs is given in the Appendix B.
  • Figure 3: Illustration of Inflammatory Response (pink node) as a Potential Confounder in the Association Between AD (left yellow node) and Depression (right yellow node). The diagram represents the shortest paths (through orange nodes) identified by Dijkstra's algorithm. The two green paths also connect inflammatory response with AD and Depression but both of them are one unit longer than the orange ones. Consequently, Dijkstra's algorithm picks the shortest path.
  • Figure 7: Own-Think KG Advantage over Tradition One-hot Encoding. Consider a dataset that includes information about various cities, Beijing, Shanghai, and Hong Kong, where each city is represented by non-numerical discrete features such as its name. In a traditional dataset, this name might be converted into a numerical form using techniques like one-hot encoding. However, this process strips the city's name of any contextual information about the city itself. Using a KG like the Own-Think KG, we can query additional information about each city to enrich the features, such as geographical, economic, demographic, cultural features, and so on.
  • Figure 8: Demonstration of Non-numeric Discrete Features Enrichment and Selection by Own-think KG. The figure includes enriched information for Beijing, Hong Kong, and Shanghai. For example, the additional features for Shanghai provided by the Own-Think KG (see Figure \ref{['fig:own_think']}) detail Shanghai's population size, average temperature, latitude, longitude, and GDP. contribute to a richer, more nuanced profile of Shanghai, compared to a one-hot encoding representation of each city, and offer additional insight as to how each aspect of a city may relate to the analysis at hand.
  • ...and 2 more figures