The Structure and Dynamics of Knowledge Graphs, with Superficiality

Loïck Lhote; Béatrice Markhoff; Arnaud Soulet

The Structure and Dynamics of Knowledge Graphs, with Superficiality

Loïck Lhote, Béatrice Markhoff, Arnaud Soulet

TL;DR

With this model, superficiality regulates the balance of the global distribution of knowledge by determining the proportion of misdescribed entities and leads to a better understanding of formal knowledge acquisition and organization.

Abstract

Large knowledge graphs combine human knowledge garnered from projects ranging from academia and institutions to enterprises and crowdsourcing. Within such graphs, each relationship between two nodes represents a basic fact involving these two entities. The diversity of the semantics of relationships constitutes the richness of knowledge graphs, leading to the emergence of singular topologies, sometimes chaotic in appearance. However, this complex characteristic can be modeled in a simple way by introducing the concept of superficiality, which controls the overlap between relationships whose facts are generated independently. With this model, superficiality also regulates the balance of the global distribution of knowledge by determining the proportion of misdescribed entities. This is the first model for the structure and dynamics of knowledge graphs. It leads to a better understanding of formal knowledge acquisition and organization.

The Structure and Dynamics of Knowledge Graphs, with Superficiality

TL;DR

Abstract

Paper Structure (20 sections, 26 equations, 5 figures, 3 tables)

This paper contains 20 sections, 26 equations, 5 figures, 3 tables.

Knowledge Graphs
Preferential Attachment and Multiplex Networks
Generative Model with Superficiality
Impact of Superficiality on Ignorance
Acknowledgments
Supplementary materials

Figures (5)

Figure 1: A very small sample of the Wikidata knowledge graph.The yellow nodes represent entities (like $\small \textrm{protein}$, $\small \textrm{Neurotrophin-3}$ or $\small \textrm{memory}$) and the arrows represent relationships (like $\small \textrm{instance of}$, $\small \textrm{field of work}$ or $\small \textrm{biological process}$). A fact corresponds to a subject entity linked to an object entity (e.g., $\small \textrm{(Neurotrophin-3, biological process, memory)}\xspace$). Note that in the complete knowledge graph, a node often has both incoming and outgoing links.
Figure 2: Comparison of real-world data (■) and data generated by our model (●) for BnF, ChEMBL and Wikidata with multiplexity phenomena highlighted by red arrows.Dashed arrows point to phenomena explainable by multimodality while normal arrows point to unexpected drops of probability density.
Figure 3: Generative model of a fact $\langle s, r_i, o \rangle$.The first step consists in randomly drawing the relationship of the fact proportionally to $\rho_r\xspace$. Then, the subject $s$ and the object $o$ are chosen by applying the boxed procedure. For each entity $s$ or $o$, there are 3 cases: (a) choose the entity with a preferential attachment mechanism (with a probability $\beta_r\xspace$), (b) insert a new entity (with a probability $(1- \beta_r\xspace\xspace) \times \sigma\xspace$) or (c) choose an entity from the knowledge graph already described by another relationship (with a probability $(1- \beta_r\xspace\xspace) \times (1 - \sigma\xspace$)). At initialization, it is necessary to force the insertion of an entity for the relationship $r$ to allow the case (a) and the insertion of new entities in the knowledge graph to allow the case (c).
Figure 4: Comparison of real-world data (■) and data generated by Barabási-Albert model (◆) and Bollobás model (▲) for BnF, ChEMBL and Wikidata.Obviously, these two simplex models have a total connectivity $P(k)$ that always decreases with $k$, making it impossible to reproduce real data.
Figure 5: Simulation of the superficiality impact on the proportion of misdescribed entities.The left plot reports the probability density of one relationship with $\beta\xspace = 0.85$ and $\alpha\xspace = 1$ (❍). On the middle, the multiplexing of 25 relationships with these same characteristics is plotted for two extreme superficialities (●). For a low superficiality $\sigma\xspace = 0.05$, we observe a huge drop of the probability $P(k)$ between $k=1$ and 25 due to an increasing probability density $P(r)$ (▲), compliant to Equation \ref{['equ:Probability']} ($\times$). For $\sigma\xspace = 0.95$, the probability density is much more regular because both distributions $P(k)$ and $P(r)$ are strictly decreasing. On the right-hand side, the proportion of misdescribed entities $P(r\xspace_e\xspace(t) \le 3)$ of the knowledge graph is very high whenever $\sigma\xspace \ge 0.5$ (red area) meaning that most of the entities are described by less than 3 relationships. This proportion converges to $1 - (1 - \sigma\xspace)^3$ when the number of relationships tends to infinity. Note that the two points correspond to the projections of the two simulations (a).

The Structure and Dynamics of Knowledge Graphs, with Superficiality

TL;DR

Abstract

The Structure and Dynamics of Knowledge Graphs, with Superficiality

Authors

TL;DR

Abstract

Table of Contents

Figures (5)