An Ad-hoc graph node vector embedding algorithm for general knowledge graphs using Kinetica-Graph
B. Kaan Karamete, Eli Glaser
TL;DR
The paper tackles the problem of deriving fixed-dimension node embeddings for general knowledge graphs, where traditional uniform embeddings are ill-suited for variable graph structure. It introduces an ad-hoc embedding framework built from four sub-features—hop-patterns, label indices, cluster indices via Recursive Spectral Bisection (RSB), and transitional probabilities—flattened into a 1D vector and weighted to reflect their contribution. A novel loss function compares node-pair embeddings to a ground-truth score formed by pairwise Jaccard similarity and label overlap, and a stochastic gradient descent procedure optimizes the sub-feature weights to minimize the average embedding error. The approach enables efficient vector-based similarity computations on knowledge graphs and supports practical downstream AI tasks, with implementation available in Kinetica-Graph's Developer Edition for replication and deployment.
Abstract
This paper discusses how to generate general graph node embeddings from knowledge graph representations. The embedded space is composed of a number of sub-features to mimic both local affinity and remote structural relevance. These sub-feature dimensions are defined by several indicators that we speculate to catch nodal similarities, such as hop-based topological patterns, the number of overlapping labels, the transitional probabilities (markov-chain probabilities), and the cluster indices computed by our recursive spectral bisection (RSB) algorithm. These measures are flattened over the one dimensional vector space into their respective sub-component ranges such that the entire set of vector similarity functions could be used for finding similar nodes. The error is defined by the sum of pairwise square differences across a randomly selected sample of graph nodes between the assumed embeddings and the ground truth estimates as our novel loss function. The ground truth is estimated to be a combination of pairwise Jaccard similarity and the number of overlapping labels. Finally, we demonstrate a multi-variate stochastic gradient descent (SGD) algorithm to compute the weighing factors among sub-vector spaces to minimize the average error using a random sampling logic.
