Table of Contents
Fetching ...

tmQM-RDF Dataset: a Knowledge Graph Representing Transition Metal Complexes

Luca Cibinel, Trond Linjordet, Johan Pensar, David Balcells, Riccardo De Bin, Basil Ell

TL;DR

This work introduces tmQM-RDF, a comprehensive RDF-based knowledge graph that unifies three tmQM datasets (tmQM, tmQMg, tmQMg-L) into a three-level hierarchy describing complete transition metal complexes, their ligands, and atomic graphs. The authors define an explicit RDF/RDFS vocabulary (TBox/ABox) to encode structural components, bonds, atoms, and ligand relationships, enabling large-scale, machine-readable curation of approximately 47,814 TMCs with ~534 million triples. They demonstrate a plausible TMC reconstruction task by mining frequent graph patterns, learning a Bayesian network-based score function, and evaluating top-k reconstruction accuracy on two 1600-TMC subpopulations, with early-TM datasets generally yielding stronger performance than late-TM datasets. The results highlight the utility of tmQM-RDF for querying, pattern-driven analysis, and data-driven TMC manipulation, suggesting significant potential for accelerated discovery and exploration in coordination chemistry through integrated, machine-readable data resources.

Abstract

Transition Metal Complexes (TMCs) have wide-ranging practical utility in chemistry, with possible applications that range from catalysis to medicinal chemistry. The study of TMCs and their properties is thus a field rich with potential, one in which machine learning and computational approaches can offer a substantial aid. For this reason, appropriate and accessible datasets, collecting a wide range of information, are required in order to facilitate the effective analysis and investigation of such compounds. This paper contributes to the data modelling effort via the introduction of the transition metal quantum mechanics RDF (tmQM-RDF) dataset, a knowledge graph constructed using the Resource Description Framework (RDF) vocabulary which collects rich and detailed descriptions of approximately 50k TMCs. These descriptions are both qualitative and quantitative in nature, encompassing the compositional nature of TMCs in terms of their constituting ligands, as well as the entirety of their molecular graphs. An example of the power of the proposed representation is presented, showcasing how the information available in tmQM-RDF can be exploited for TMC manipulation tasks, achieving promising performance even with relatively simple probabilistic models.

tmQM-RDF Dataset: a Knowledge Graph Representing Transition Metal Complexes

TL;DR

This work introduces tmQM-RDF, a comprehensive RDF-based knowledge graph that unifies three tmQM datasets (tmQM, tmQMg, tmQMg-L) into a three-level hierarchy describing complete transition metal complexes, their ligands, and atomic graphs. The authors define an explicit RDF/RDFS vocabulary (TBox/ABox) to encode structural components, bonds, atoms, and ligand relationships, enabling large-scale, machine-readable curation of approximately 47,814 TMCs with ~534 million triples. They demonstrate a plausible TMC reconstruction task by mining frequent graph patterns, learning a Bayesian network-based score function, and evaluating top-k reconstruction accuracy on two 1600-TMC subpopulations, with early-TM datasets generally yielding stronger performance than late-TM datasets. The results highlight the utility of tmQM-RDF for querying, pattern-driven analysis, and data-driven TMC manipulation, suggesting significant potential for accelerated discovery and exploration in coordination chemistry through integrated, machine-readable data resources.

Abstract

Transition Metal Complexes (TMCs) have wide-ranging practical utility in chemistry, with possible applications that range from catalysis to medicinal chemistry. The study of TMCs and their properties is thus a field rich with potential, one in which machine learning and computational approaches can offer a substantial aid. For this reason, appropriate and accessible datasets, collecting a wide range of information, are required in order to facilitate the effective analysis and investigation of such compounds. This paper contributes to the data modelling effort via the introduction of the transition metal quantum mechanics RDF (tmQM-RDF) dataset, a knowledge graph constructed using the Resource Description Framework (RDF) vocabulary which collects rich and detailed descriptions of approximately 50k TMCs. These descriptions are both qualitative and quantitative in nature, encompassing the compositional nature of TMCs in terms of their constituting ligands, as well as the entirety of their molecular graphs. An example of the power of the proposed representation is presented, showcasing how the information available in tmQM-RDF can be exploited for TMC manipulation tasks, achieving promising performance even with relatively simple probabilistic models.
Paper Structure (66 sections, 1 theorem, 57 equations, 13 figures, 10 tables)

This paper contains 66 sections, 1 theorem, 57 equations, 13 figures, 10 tables.

Key Result

Proposition A1

Let $p, q$ be two frequent patterns given a graph dataset $\mathcal{G}$. Then the following statements are equivalent:

Figures (13)

  • Figure 1: Statements about general concepts are collected in the TBox. Here the TBox of tmQM-RDF is visually represented. Nodes represent the available classes. Solid edges represent the available predicates, where the tail and the head of the edge represent domain and range restrictions for that predicate. Dashed edges represent class-related assertions, i.e. subclass relationships between classes or class assignments. Bold edges highlight the predicate rdf:type and its subproperties (notice that an edge can be both solid and bold or both dashed and bold). The symbol * is used as a placeholder for a sequence of charachters, representing chemical elements symbols (in tmAr:* and lgCr:MetalCentre_*), ligand ids (in lgLr:Ligand_*) or property names (in cmTp:*, lgLrp:*, tmAp:*, tmArp:* and tmBp:*).
  • Figure 2: (a) The tmQM-RDF KG is the result of the integration of tmQM, tmQMg and tmQMg-L. This scheme illustrates syntetically how exactly each dataset contributes to the final three-level representation. (b) A visual example of an ABox compliant with the TBox in Figure \ref{['fig:tbox']} showcasing how the data from the tmQM series can be represented. Nodes with neither background nor border represent either blank nodes (_:*) or literals ("*"). Nodes with a white background represent classes. The remaining nodes represent instances of classes. For the sake of readability, not all the features, or literal datatypes, are represented.
  • Figure 3: (a) The bar plot of the counts of the appearances of the metal centres found in tmQM-RDF. (b) The $95\%$ quantiles of the empirical distributions of the count of single ligand instances ($q_{0.95}^\mathrm{inst}$) and of the count of TMCs that contain at least one copy of each ligand ($q_{0.95}^\mathrm{copy}$), in each subpopulation of TMCs identified by the metal centre. Vertical lines are added for readability. (c) The bar plot of the counts of the appearances of the 10 most frequent ligands in tmQM-RDF. (d) A visual representation of the 10 most frequent ligands in tmQM-RDF.
  • Figure A1: (a) A simple set of facts, in the form of a set of triples written in Turtle syntax. These facts are stated using URIs (:*), blank nodes (_:*) and literals ("*") (b) The equivalent directed labelled graph representation of the same set of facts.
  • Figure A2: The set of facts from Figure \ref{['fig:example_rdf']} is expanded by adding a TBox. The TBox defines several classes that clarify the role and nature of the entities in the ABox, while also specifying appropriate domain and range constraints for the predicates.
  • ...and 8 more figures

Theorems & Definitions (6)

  • Example A1
  • Example A2
  • Proposition A1
  • proof
  • Example A1
  • Example A1