Table of Contents
Fetching ...

Representation-Enhanced Neural Knowledge Integration with Application to Large-Scale Medical Ontology Learning

Suqi Liu, Tianxi Cai, Xiaoou Li

TL;DR

This work tackles learning large-scale biomedical knowledge graphs with many relation types by introducing RENKI, a framework that combines representation-learning initialization with embedding-based neural KG models and a weighted least squares objective. It provides nonasymptotic, finite-sample guarantees expressed through oracle inequalities tied to the pseudo-dimension of the score-function class, and it develops pseudo-dimension bounds for both fixed and trainable embeddings within IP-NKG and C-NKG architectures. The authors validate the theory via simulations and demonstrate a real-world medical KG application integrating pretrained language representations with KG links, achieving high AUCs across nine relation types and showing the value of weighting to handle heterogeneous relations. Overall, RENKI enables robust, scalable learning of complex biomedical knowledge graphs, with practical impact on data integration, disease understanding, and potential mitigation of language model hallucinations through structured knowledge priors.

Abstract

A large-scale knowledge graph enhances reproducibility in biomedical data discovery by providing a standardized, integrated framework that ensures consistent interpretation across diverse datasets. It improves generalizability by connecting data from various sources, enabling broader applicability of findings across different populations and conditions. Generating reliable knowledge graph, leveraging multi-source information from existing literature, however, is challenging especially with a large number of node sizes and heterogeneous relations. In this paper, we propose a general theoretically guaranteed statistical framework, called RENKI, to enable simultaneous learning of multiple relation types. RENKI generalizes various network models widely used in statistics and computer science. The proposed framework incorporates representation learning output into initial entity embedding of a neural network that approximates the score function for the knowledge graph and continuously trains the model to fit observed facts. We prove nonasymptotic bounds for in-sample and out-of-sample weighted MSEs in relation to the pseudo-dimension of the knowledge graph function class. Additionally, we provide pseudo-dimensions for score functions based on multilayer neural networks with ReLU activation function, in the scenarios when the embedding parameters either fixed or trainable. Finally, we complement our theoretical results with numerical studies and apply the method to learn a comprehensive medical knowledge graph combining a pretrained language model representation with knowledge graph links observed in several medical ontologies. The experiments justify our theoretical findings and demonstrate the effect of weighting in the presence of heterogeneous relations and the benefit of incorporating representation learning in nonparametric models.

Representation-Enhanced Neural Knowledge Integration with Application to Large-Scale Medical Ontology Learning

TL;DR

This work tackles learning large-scale biomedical knowledge graphs with many relation types by introducing RENKI, a framework that combines representation-learning initialization with embedding-based neural KG models and a weighted least squares objective. It provides nonasymptotic, finite-sample guarantees expressed through oracle inequalities tied to the pseudo-dimension of the score-function class, and it develops pseudo-dimension bounds for both fixed and trainable embeddings within IP-NKG and C-NKG architectures. The authors validate the theory via simulations and demonstrate a real-world medical KG application integrating pretrained language representations with KG links, achieving high AUCs across nine relation types and showing the value of weighting to handle heterogeneous relations. Overall, RENKI enables robust, scalable learning of complex biomedical knowledge graphs, with practical impact on data integration, disease understanding, and potential mitigation of language model hallucinations through structured knowledge priors.

Abstract

A large-scale knowledge graph enhances reproducibility in biomedical data discovery by providing a standardized, integrated framework that ensures consistent interpretation across diverse datasets. It improves generalizability by connecting data from various sources, enabling broader applicability of findings across different populations and conditions. Generating reliable knowledge graph, leveraging multi-source information from existing literature, however, is challenging especially with a large number of node sizes and heterogeneous relations. In this paper, we propose a general theoretically guaranteed statistical framework, called RENKI, to enable simultaneous learning of multiple relation types. RENKI generalizes various network models widely used in statistics and computer science. The proposed framework incorporates representation learning output into initial entity embedding of a neural network that approximates the score function for the knowledge graph and continuously trains the model to fit observed facts. We prove nonasymptotic bounds for in-sample and out-of-sample weighted MSEs in relation to the pseudo-dimension of the knowledge graph function class. Additionally, we provide pseudo-dimensions for score functions based on multilayer neural networks with ReLU activation function, in the scenarios when the embedding parameters either fixed or trainable. Finally, we complement our theoretical results with numerical studies and apply the method to learn a comprehensive medical knowledge graph combining a pretrained language model representation with knowledge graph links observed in several medical ontologies. The experiments justify our theoretical findings and demonstrate the effect of weighting in the presence of heterogeneous relations and the benefit of incorporating representation learning in nonparametric models.

Paper Structure

This paper contains 25 sections, 10 theorems, 80 equations, 5 figures, 4 tables.

Key Result

Theorem 1

Let $p \coloneq \mathop{\mathrm{Pdim}}\nolimits(\mathcal{H})$ denote the pseudo-dimension of the function class $\mathcal{H}$, and $\hat{f}$ satisfies eq:approx_sol.

Figures (5)

  • Figure 1: Schematic diagrams of neural knowledge graph models. Red blocks represent trainable embedding parameters. Red lines represent trainable weights. Blue circles and blocks stand for values after the operations.
  • Figure 2: Effect of sample sizes. We report the mean of $10$ independent random runs and the error bars represent the standard deviation calculated from them.
  • Figure 3: Effect of initial embedding. We report the mean of $10$ independent random runs and the error bars represent the standard deviation calculated from them.
  • Figure 4: Classification error for different methods. The error bars are calculated from $10$ independent random runs.
  • Figure 5: Effect of weighting.

Theorems & Definitions (19)

  • Definition 1: Feed-forward network
  • Definition 2: Inner Product Neural Knowledge Graph Model (IP-NKG)
  • Definition 3: Concatenation Neural Knowledge Graph Model (C-NKG)
  • Remark 1
  • Theorem 1: Oracle inequalities
  • Remark 2
  • Lemma 2: Mixture of experts
  • Lemma 3: Pseudo-dimension with fixed embedding
  • Lemma 4: Pseudo-dimension with trainable embedding
  • Remark 3
  • ...and 9 more