Table of Contents
Fetching ...

Bench4KE: Benchmarking Automated Competency Question Generation

Anna Sofia Lippolis, Minh Davide Ragagni, Paolo Ciancarini, Andrea Giovanni Nuzzolese, Valentina Presutti

TL;DR

Bench4KE tackles the lack of standardized evaluation for automated Competency Question (CQ) generation in Knowledge Engineering by delivering an extensible API-based benchmarking system. It provides a gold-standard dataset of 843 CQs from 17 real-world ontology projects and a multi-metric validation pipeline that combines lexical similarity measures with an LLM-based semantic judge. The paper reports a baseline comparison of six recent CQ-generation systems and demonstrates Bench4KE’s extensibility to additional KE tasks and input modalities. This work aims to improve reproducibility and cross-domain comparability in KE automation, fostering community-driven development and evaluation.

Abstract

The availability of Large Language Models (LLMs) presents a unique opportunity to reinvigorate research on Knowledge Engineering (KE) automation. This trend is already evident in recent efforts developing LLM-based methods and tools for the automatic generation of Competency Questions (CQs), natural language questions used by ontology engineers to define the functional requirements of an ontology. However, the evaluation of these tools lacks standardization. This undermines the methodological rigor and hinders the replication and comparison of results. To address this gap, we introduce Bench4KE, an extensible API-based benchmarking system for KE automation. The presented release focuses on evaluating tools that generate CQs automatically. Bench4KE provides a curated gold standard consisting of CQ datasets from 17 real-world ontology engineering projects and uses a suite of similarity metrics to assess the quality of the CQs generated. We present a comparative analysis of 6 recent CQ generation systems, which are based on LLMs, establishing a baseline for future research. Bench4KE is also designed to accommodate additional KE automation tasks, such as SPARQL query generation, ontology testing and drafting. Code and datasets are publicly available under the Apache 2.0 license.

Bench4KE: Benchmarking Automated Competency Question Generation

TL;DR

Bench4KE tackles the lack of standardized evaluation for automated Competency Question (CQ) generation in Knowledge Engineering by delivering an extensible API-based benchmarking system. It provides a gold-standard dataset of 843 CQs from 17 real-world ontology projects and a multi-metric validation pipeline that combines lexical similarity measures with an LLM-based semantic judge. The paper reports a baseline comparison of six recent CQ-generation systems and demonstrates Bench4KE’s extensibility to additional KE tasks and input modalities. This work aims to improve reproducibility and cross-domain comparability in KE automation, fostering community-driven development and evaluation.

Abstract

The availability of Large Language Models (LLMs) presents a unique opportunity to reinvigorate research on Knowledge Engineering (KE) automation. This trend is already evident in recent efforts developing LLM-based methods and tools for the automatic generation of Competency Questions (CQs), natural language questions used by ontology engineers to define the functional requirements of an ontology. However, the evaluation of these tools lacks standardization. This undermines the methodological rigor and hinders the replication and comparison of results. To address this gap, we introduce Bench4KE, an extensible API-based benchmarking system for KE automation. The presented release focuses on evaluating tools that generate CQs automatically. Bench4KE provides a curated gold standard consisting of CQ datasets from 17 real-world ontology engineering projects and uses a suite of similarity metrics to assess the quality of the CQs generated. We present a comparative analysis of 6 recent CQ generation systems, which are based on LLMs, establishing a baseline for future research. Bench4KE is also designed to accommodate additional KE automation tasks, such as SPARQL query generation, ontology testing and drafting. Code and datasets are publicly available under the Apache 2.0 license.

Paper Structure

This paper contains 29 sections, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Bench4KE's workflow for CQ validation.
  • Figure 2: Bench4KE's system architecture.
  • Figure 3: Mean Hit Rate across systems according to the usage scenarios.