tmQM-RDF Dataset: a Knowledge Graph Representing Transition Metal Complexes
Luca Cibinel, Trond Linjordet, Johan Pensar, David Balcells, Riccardo De Bin, Basil Ell
TL;DR
This work introduces tmQM-RDF, a comprehensive RDF-based knowledge graph that unifies three tmQM datasets (tmQM, tmQMg, tmQMg-L) into a three-level hierarchy describing complete transition metal complexes, their ligands, and atomic graphs. The authors define an explicit RDF/RDFS vocabulary (TBox/ABox) to encode structural components, bonds, atoms, and ligand relationships, enabling large-scale, machine-readable curation of approximately 47,814 TMCs with ~534 million triples. They demonstrate a plausible TMC reconstruction task by mining frequent graph patterns, learning a Bayesian network-based score function, and evaluating top-k reconstruction accuracy on two 1600-TMC subpopulations, with early-TM datasets generally yielding stronger performance than late-TM datasets. The results highlight the utility of tmQM-RDF for querying, pattern-driven analysis, and data-driven TMC manipulation, suggesting significant potential for accelerated discovery and exploration in coordination chemistry through integrated, machine-readable data resources.
Abstract
Transition Metal Complexes (TMCs) have wide-ranging practical utility in chemistry, with possible applications that range from catalysis to medicinal chemistry. The study of TMCs and their properties is thus a field rich with potential, one in which machine learning and computational approaches can offer a substantial aid. For this reason, appropriate and accessible datasets, collecting a wide range of information, are required in order to facilitate the effective analysis and investigation of such compounds. This paper contributes to the data modelling effort via the introduction of the transition metal quantum mechanics RDF (tmQM-RDF) dataset, a knowledge graph constructed using the Resource Description Framework (RDF) vocabulary which collects rich and detailed descriptions of approximately 50k TMCs. These descriptions are both qualitative and quantitative in nature, encompassing the compositional nature of TMCs in terms of their constituting ligands, as well as the entirety of their molecular graphs. An example of the power of the proposed representation is presented, showcasing how the information available in tmQM-RDF can be exploited for TMC manipulation tasks, achieving promising performance even with relatively simple probabilistic models.
