Table of Contents
Fetching ...

Overcoming the Generalization Limits of SLM Finetuning for Shape-Based Extraction of Datatype and Object Properties

Célian Ringwald, Fabien Gandon, Catherine Faron, Franck Michel, Hanna Abi Akl

TL;DR

This work broadens SHACL-guided RDF extraction from datatype to include object properties and tackles the long-tail problem by testing data scaling, stratification, weighted loss, and data augmentation. A distillation-based dual-graph base and a distilled dataset are used to train small language models, demonstrating strong performance and competitive with, or surpassing, REBEL while maintaining structural correctness. Key finding: ensuring a minimal, property-level exposure (approximately 1,000 examples per property) is crucial for balanced generalization across both frequent and rare properties; scaling helps micro-F1 but not macro-F1 without sufficient exposure. The study provides practical guidelines and releases datasets and code to support reproducible, shape-aware semantic relation extraction.

Abstract

Small language models (SLMs) have shown promises for relation extraction (RE) when extracting RDF triples guided by SHACL shapes focused on common datatype properties. This paper investigates how SLMs handle both datatype and object properties for a complete RDF graph extraction. We show that the key bottleneck is related to long-tail distribution of rare properties. To solve this issue, we evaluate several strategies: stratified sampling, weighted loss, dataset scaling, and template-based synthetic data augmentation. We show that the best strategy to perform equally well over unbalanced target properties is to build a training set where the number of occurrences of each property exceeds a given threshold. To enable reproducibility, we publicly released our datasets, experimental results and code. Our findings offer practical guidance for training shape-aware SLMs and highlight promising directions for future work in semantic RE.

Overcoming the Generalization Limits of SLM Finetuning for Shape-Based Extraction of Datatype and Object Properties

TL;DR

This work broadens SHACL-guided RDF extraction from datatype to include object properties and tackles the long-tail problem by testing data scaling, stratification, weighted loss, and data augmentation. A distillation-based dual-graph base and a distilled dataset are used to train small language models, demonstrating strong performance and competitive with, or surpassing, REBEL while maintaining structural correctness. Key finding: ensuring a minimal, property-level exposure (approximately 1,000 examples per property) is crucial for balanced generalization across both frequent and rare properties; scaling helps micro-F1 but not macro-F1 without sufficient exposure. The study provides practical guidelines and releases datasets and code to support reproducible, shape-aware semantic relation extraction.

Abstract

Small language models (SLMs) have shown promises for relation extraction (RE) when extracting RDF triples guided by SHACL shapes focused on common datatype properties. This paper investigates how SLMs handle both datatype and object properties for a complete RDF graph extraction. We show that the key bottleneck is related to long-tail distribution of rare properties. To solve this issue, we evaluate several strategies: stratified sampling, weighted loss, dataset scaling, and template-based synthetic data augmentation. We show that the best strategy to perform equally well over unbalanced target properties is to build a training set where the number of occurrences of each property exceeds a given threshold. To enable reproducibility, we publicly released our datasets, experimental results and code. Our findings offer practical guidance for training shape-aware SLMs and highlight promising directions for future work in semantic RE.

Paper Structure

This paper contains 30 sections, 11 equations, 2 figures, 8 tables.

Figures (2)

  • Figure 1: Averaged micro F1 by property and model
  • Figure 2: Averaged micro F1 by property and model