Table of Contents
Fetching ...

Human Evaluation of Procedural Knowledge Graph Extraction from Text with Large Language Models

Valentina Anita Carriero, Antonia Azzini, Ilaria Baroni, Mario Scrocca, Irene Celino

TL;DR

This study explores turning unformatted textual procedures into a structured Procedural Knowledge Graph by leveraging prompt-engineered LLMs and a predefined ontology. It122 uses WikiHow-derived data and a two-stage prompting pipeline to produce RDF Turtle representations, then conducts a large-scale human evaluation to assess perceived quality, usefulness, and potential biases in AI-based annotation. Findings indicate that LLM outputs are generally high-quality but exhibit more modest usefulness, with only slight evidence of bias against AI; results support a pragmatic, human-in-the-loop approach for deploying PKKG in real-world settings. The work advances procedural knowledge extraction by combining ontology-guided KG construction with rigorous human-centric evaluation and sets directions for more robust, multi-format, and retrieval-augmented extensions.

Abstract

Procedural Knowledge is the know-how expressed in the form of sequences of steps needed to perform some tasks. Procedures are usually described by means of natural language texts, such as recipes or maintenance manuals, possibly spread across different documents and systems, and their interpretation and subsequent execution is often left to the reader. Representing such procedures in a Knowledge Graph (KG) can be the basis to build digital tools to support those users who need to apply or execute them. In this paper, we leverage Large Language Model (LLM) capabilities and propose a prompt engineering approach to extract steps, actions, objects, equipment and temporal information from a textual procedure, in order to populate a Procedural KG according to a pre-defined ontology. We evaluate the KG extraction results by means of a user study, in order to qualitatively and quantitatively assess the perceived quality and usefulness of the LLM-extracted procedural knowledge. We show that LLMs can produce outputs of acceptable quality and we assess the subjective perception of AI by human evaluators.

Human Evaluation of Procedural Knowledge Graph Extraction from Text with Large Language Models

TL;DR

This study explores turning unformatted textual procedures into a structured Procedural Knowledge Graph by leveraging prompt-engineered LLMs and a predefined ontology. It122 uses WikiHow-derived data and a two-stage prompting pipeline to produce RDF Turtle representations, then conducts a large-scale human evaluation to assess perceived quality, usefulness, and potential biases in AI-based annotation. Findings indicate that LLM outputs are generally high-quality but exhibit more modest usefulness, with only slight evidence of bias against AI; results support a pragmatic, human-in-the-loop approach for deploying PKKG in real-world settings. The work advances procedural knowledge extraction by combining ontology-guided KG construction with rigorous human-centric evaluation and sets directions for more robust, multi-format, and retrieval-augmented extensions.

Abstract

Procedural Knowledge is the know-how expressed in the form of sequences of steps needed to perform some tasks. Procedures are usually described by means of natural language texts, such as recipes or maintenance manuals, possibly spread across different documents and systems, and their interpretation and subsequent execution is often left to the reader. Representing such procedures in a Knowledge Graph (KG) can be the basis to build digital tools to support those users who need to apply or execute them. In this paper, we leverage Large Language Model (LLM) capabilities and propose a prompt engineering approach to extract steps, actions, objects, equipment and temporal information from a textual procedure, in order to populate a Procedural KG according to a pre-defined ontology. We evaluate the KG extraction results by means of a user study, in order to qualitatively and quantitatively assess the perceived quality and usefulness of the LLM-extracted procedural knowledge. We show that LLMs can produce outputs of acceptable quality and we assess the subjective perception of AI by human evaluators.

Paper Structure

This paper contains 12 sections, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Ontology used in the experiments.
  • Figure 2: Example procedure and output provided in prompt P1.
  • Figure 3: Illustration of the Chain of Prompt for Procedural KG extraction, along with an example of extracted step.
  • Figure 4: Distribution of human ratings on the evaluation items (cf. Table \ref{['tab:llm-eval']}).