LLM-based Zero-shot Triple Extraction for Automated Ontology Generation from Software Engineering Standards
Songhui Yue
TL;DR
This work addresses automated ontology generation from Software Engineering Standards (SES) by focusing on relation triple extraction (RTE) within an open-source, LLM-assisted workflow. It proposes an assertion-led ABox–TBox co-extraction approach that builds a reusable ontology scaffold $G=(V,E)$ from SES text, integrating constrained LLM prompts with postprocessing and normalization. The study constructs three expert-annotated reference sets (Ref-Short, Ref-Medium, Ref-Long) to evaluate robustness across granularities and demonstrates that a 7B open-source LLM can be competitive with the Stanford OpenIE baseline, especially in precision. The contributions include a practical, end-to-end workflow for SES AOG, a scaffold-building methodology, and preliminary benchmarks, with future work aiming to extend to the full SES, improve recall via cross-sentence techniques, and release an OWL 2 ontology alongside public benchmarks for broader use.
Abstract
Ontologies have supported knowledge representation and white-box reasoning for decades; thus, the automated ontology generation (AOG) plays a crucial role in scaling their use. Software engineering standards (SES) consist of long, unstructured text (with high noise) and paragraphs with domain-specific terms. In this setting, relation triple extraction (RTE), together with term extraction, constitutes the first stage toward AOG. This work proposes an open-source large language model (LLM)-assisted approach to RTE for SES. Instead of solely relying on prompt-engineering-based methods, this study promotes the use of LLMs as an aid in constructing ontologies and explores an effective AOG workflow that includes document segmentation, candidate term mining, LLM-based relation inference, term normalization, and cross-section alignment. Expert-annotated reference sets at three granularities are constructed and used to evaluate the ontology generated from the study. The results show that it is comparable and potentially superior to the OpenIE method of triple extraction.
