Kastor: Fine-tuned Small Language Models for Shape-based Active Relation Extraction
Ringwald Celian, Gandon Fabien, Faron Catherine, Michel Franck, Abi Akl Hanna
TL;DR
Kastor presents a shape-based RDF relation extraction framework that fine-tunes small language models on noisy, domain-specific KBs by leveraging SHACL patterns. It relaxes the traditional maximal-shape constraint to use example-specific patterns and applies rule-based graph augmentation and knowledge distillation to build a refined training base. An iterative, light active learning loop with a domain expert generates gold data, enabling the SLM to generalize to unseen patterns and improve KB completion. The approach demonstrates strong per-sample pattern diversity, affordable training cost, and improved factual coverage, suggesting practical impact for scalable, domain-focused KB curation.
Abstract
RDF pattern-based extraction is a compelling approach for fine-tuning small language models (SLMs) by focusing a relation extraction task on a specified SHACL shape. This technique enables the development of efficient models trained on limited text and RDF data. In this article, we introduce Kastor, a framework that advances this approach to meet the demands for completing and refining knowledge bases in specialized domains. Kastor reformulates the traditional validation task, shifting from single SHACL shape validation to evaluating all possible combinations of properties derived from the shape. By selecting the optimal combination for each training example, the framework significantly enhances model generalization and performance. Additionally, Kastor employs an iterative learning process to refine noisy knowledge bases, enabling the creation of robust models capable of uncovering new, relevant facts
