OAEI Machine Learning Dataset for Online Model Generation
Sven Hertling, Ebrahim Norouzi, Harald Sack
TL;DR
The paper tackles fair benchmarking of ML-driven ontology and knowledge graph alignment within OAEI by enabling online model adaptation through new train/validation/test splits. It introduces stratified reference alignment partitions of $20\%$, $10\%$, and $70\%$ for training, validation, and testing, designed to preserve distribution across entity types, relation types, and mapping difficulty, and avoids offline task-specific model packaging. The dataset integrates with MELT and comes with generation and evaluation code, while training data include only positives and rely on on-the-fly hard negative generation during learning. Use-case experiments with Matcha, LogMap, and OLaLa show that online confidence-threshold tuning via the dataset can substantially improve $F_1$-Measure in some tracks, illustrating practical benefits of adaptive thresholds. The work also lays groundwork for future in-domain transfer learning and tracks with sparse correspondences, aiming to promote robust, fair, and reusable ML-driven OAEI benchmarking.
Abstract
Ontology and knowledge graph matching systems are evaluated annually by the Ontology Alignment Evaluation Initiative (OAEI). More and more systems use machine learning-based approaches, including large language models. The training and validation datasets are usually determined by the system developer and often a subset of the reference alignments are used. This sampling is against the OAEI rules and makes a fair comparison impossible. Furthermore, those models are trained offline (a trained and optimized model is packaged into the matcher) and therefore the systems are specifically trained for those tasks. In this paper, we introduce a dataset that contains training, validation, and test sets for most of the OAEI tracks. Thus, online model learning (the systems must adapt to the given input alignment without human intervention) is made possible to enable a fair comparison for ML-based systems. We showcase the usefulness of the dataset by fine-tuning the confidence thresholds of popular systems.
