Table of Contents
Fetching ...

An Ad-hoc graph node vector embedding algorithm for general knowledge graphs using Kinetica-Graph

B. Kaan Karamete, Eli Glaser

TL;DR

The paper tackles the problem of deriving fixed-dimension node embeddings for general knowledge graphs, where traditional uniform embeddings are ill-suited for variable graph structure. It introduces an ad-hoc embedding framework built from four sub-features—hop-patterns, label indices, cluster indices via Recursive Spectral Bisection (RSB), and transitional probabilities—flattened into a 1D vector and weighted to reflect their contribution. A novel loss function compares node-pair embeddings to a ground-truth score formed by pairwise Jaccard similarity and label overlap, and a stochastic gradient descent procedure optimizes the sub-feature weights to minimize the average embedding error. The approach enables efficient vector-based similarity computations on knowledge graphs and supports practical downstream AI tasks, with implementation available in Kinetica-Graph's Developer Edition for replication and deployment.

Abstract

This paper discusses how to generate general graph node embeddings from knowledge graph representations. The embedded space is composed of a number of sub-features to mimic both local affinity and remote structural relevance. These sub-feature dimensions are defined by several indicators that we speculate to catch nodal similarities, such as hop-based topological patterns, the number of overlapping labels, the transitional probabilities (markov-chain probabilities), and the cluster indices computed by our recursive spectral bisection (RSB) algorithm. These measures are flattened over the one dimensional vector space into their respective sub-component ranges such that the entire set of vector similarity functions could be used for finding similar nodes. The error is defined by the sum of pairwise square differences across a randomly selected sample of graph nodes between the assumed embeddings and the ground truth estimates as our novel loss function. The ground truth is estimated to be a combination of pairwise Jaccard similarity and the number of overlapping labels. Finally, we demonstrate a multi-variate stochastic gradient descent (SGD) algorithm to compute the weighing factors among sub-vector spaces to minimize the average error using a random sampling logic.

An Ad-hoc graph node vector embedding algorithm for general knowledge graphs using Kinetica-Graph

TL;DR

The paper tackles the problem of deriving fixed-dimension node embeddings for general knowledge graphs, where traditional uniform embeddings are ill-suited for variable graph structure. It introduces an ad-hoc embedding framework built from four sub-features—hop-patterns, label indices, cluster indices via Recursive Spectral Bisection (RSB), and transitional probabilities—flattened into a 1D vector and weighted to reflect their contribution. A novel loss function compares node-pair embeddings to a ground-truth score formed by pairwise Jaccard similarity and label overlap, and a stochastic gradient descent procedure optimizes the sub-feature weights to minimize the average embedding error. The approach enables efficient vector-based similarity computations on knowledge graphs and supports practical downstream AI tasks, with implementation available in Kinetica-Graph's Developer Edition for replication and deployment.

Abstract

This paper discusses how to generate general graph node embeddings from knowledge graph representations. The embedded space is composed of a number of sub-features to mimic both local affinity and remote structural relevance. These sub-feature dimensions are defined by several indicators that we speculate to catch nodal similarities, such as hop-based topological patterns, the number of overlapping labels, the transitional probabilities (markov-chain probabilities), and the cluster indices computed by our recursive spectral bisection (RSB) algorithm. These measures are flattened over the one dimensional vector space into their respective sub-component ranges such that the entire set of vector similarity functions could be used for finding similar nodes. The error is defined by the sum of pairwise square differences across a randomly selected sample of graph nodes between the assumed embeddings and the ground truth estimates as our novel loss function. The ground truth is estimated to be a combination of pairwise Jaccard similarity and the number of overlapping labels. Finally, we demonstrate a multi-variate stochastic gradient descent (SGD) algorithm to compute the weighing factors among sub-vector spaces to minimize the average error using a random sampling logic.
Paper Structure (12 sections, 6 equations, 17 figures, 1 table)

This paper contains 12 sections, 6 equations, 17 figures, 1 table.

Figures (17)

  • Figure 1: The layout of the sub-features within the vector embedding space; The sub-features, $s_{0..3,k..p}$ per node are hop based topology pattern, associated labels, the cluster index computed using the recursive spectral bisection (RSB) algorithm and the transitional probabilities (markov chain solver), respectively. The $k,m,n,p$ are vector ranges per each sub-feature. Four weight factors, $w_{0..3}$ that will be used to minimize the average total loss per node are eventually multiplied with each value within the sub-range of its respective feature $s_{0..3}$ for the final embedded vector content per graph node.
  • Figure 2: Hop pattern of a node is defined by the number of forks and the number of nodes in each fork arm as shown with the respective colors per hop; e.g., second hop depicted as cyan has two forks with two nodes at each fork arm as shown in the array below.
  • Figure 3: Graph-SQL syntax for Kinetica-Graph's create/graph Restful API. The nodes and edges components are depicted explicitly, i.e., with constants instead of reading from table columns (unless the example is simple the usual way is to list the nodes/edges in a DB table or stream in). Bottom is showing the chess graph ontology using the label keys; all edges ($100 \%$) are in between Gender and Interest labeled nodes via Relation group edge label key.
  • Figure 4: 3D visualization of the graph generated by tha call in Figure\ref{['Figure:creategraph']} with node/edge label associations using d3's force layout
  • Figure 5: The response of Kinetica-Graph's create/graph call depicting node-label associations as a relational DB table. E.g.: Alex and Tom has two common labels, namely, chess and MALE.
  • ...and 12 more figures