Table of Contents
Fetching ...

Trusted Knowledge Extraction for Operations and Maintenance Intelligence

Kathleen P. Mealey, Jonathan A. Karr, Priscila Saboia Moreira, Paul R. Brenner, Charles F. Vardeman

TL;DR

The paper tackles the challenge of deriving operational and maintenance intelligence from domain-specific, largely unstructured aviation data. It introduces OMIn, a publicly available FAA-derived benchmark with gold standards for NER, CR, and NEL, and performs a zero-shot evaluation of sixteen open-source KE tools across NER, CR, NEL, and RE in an on-premises, confidential setting. Findings reveal significant domain-transfer gaps: NER and NEL show low recall, CR transfers more robustly, and RE suffers from heterogeneous ontologies and tool-specific relation schemas, leading to limited immediate deployment readiness. The work highlights the need for domain-adapted resources, a shared maintenance ontology for fair RE benchmarking, and open datasets to accelerate trusted KE for safety-critical industries, while providing a reproducible baseline for future research and domain adaptation efforts.

Abstract

Deriving operational intelligence from organizational data repositories is a key challenge due to the dichotomy of data confidentiality vs data integration objectives, as well as the limitations of Natural Language Processing (NLP) tools relative to the specific knowledge structure of domains such as operations and maintenance. In this work, we discuss Knowledge Graph construction and break down the Knowledge Extraction process into its Named Entity Recognition, Coreference Resolution, Named Entity Linking, and Relation Extraction functional components. We then evaluate sixteen NLP tools in concert with or in comparison to the rapidly advancing capabilities of Large Language Models (LLMs). We focus on the operational and maintenance intelligence use case for trusted applications in the aircraft industry. A baseline dataset is derived from a rich public domain US Federal Aviation Administration dataset focused on equipment failures or maintenance requirements. We assess the zero-shot performance of NLP and LLM tools that can be operated within a controlled, confidential environment (no data is sent to third parties). Based on our observation of significant performance limitations, we discuss the challenges related to trusted NLP and LLM tools as well as their Technical Readiness Level for wider use in mission-critical industries such as aviation. We conclude with recommendations to enhance trust and provide our open-source curated dataset to support further baseline testing and evaluation.

Trusted Knowledge Extraction for Operations and Maintenance Intelligence

TL;DR

The paper tackles the challenge of deriving operational and maintenance intelligence from domain-specific, largely unstructured aviation data. It introduces OMIn, a publicly available FAA-derived benchmark with gold standards for NER, CR, and NEL, and performs a zero-shot evaluation of sixteen open-source KE tools across NER, CR, NEL, and RE in an on-premises, confidential setting. Findings reveal significant domain-transfer gaps: NER and NEL show low recall, CR transfers more robustly, and RE suffers from heterogeneous ontologies and tool-specific relation schemas, leading to limited immediate deployment readiness. The work highlights the need for domain-adapted resources, a shared maintenance ontology for fair RE benchmarking, and open datasets to accelerate trusted KE for safety-critical industries, while providing a reproducible baseline for future research and domain adaptation efforts.

Abstract

Deriving operational intelligence from organizational data repositories is a key challenge due to the dichotomy of data confidentiality vs data integration objectives, as well as the limitations of Natural Language Processing (NLP) tools relative to the specific knowledge structure of domains such as operations and maintenance. In this work, we discuss Knowledge Graph construction and break down the Knowledge Extraction process into its Named Entity Recognition, Coreference Resolution, Named Entity Linking, and Relation Extraction functional components. We then evaluate sixteen NLP tools in concert with or in comparison to the rapidly advancing capabilities of Large Language Models (LLMs). We focus on the operational and maintenance intelligence use case for trusted applications in the aircraft industry. A baseline dataset is derived from a rich public domain US Federal Aviation Administration dataset focused on equipment failures or maintenance requirements. We assess the zero-shot performance of NLP and LLM tools that can be operated within a controlled, confidential environment (no data is sent to third parties). Based on our observation of significant performance limitations, we discuss the challenges related to trusted NLP and LLM tools as well as their Technical Readiness Level for wider use in mission-critical industries such as aviation. We conclude with recommendations to enhance trust and provide our open-source curated dataset to support further baseline testing and evaluation.

Paper Structure

This paper contains 57 sections, 2 equations, 5 figures, 23 tables.

Figures (5)

  • Figure 1: KE Workflow. The Knowledge Extraction Workflow is an approach to extracting graphical data from unstructured text. It begins with CR, which identifies different words or phrases that refer to the same entity. Then, in NEL, entities are recognized (NER) and linked to corresponding unique IDs in an external KB. Lastly, in RE, entities are recognized (NER) and connected through well-defined relationships.
  • Figure 2: Conceptual Overview of Methodology and Evaluation Strategy Used in This Study. To create the OMIn dataset, we proceed through data selection (Subsection \ref{['sec:data_selection']}), pre-processing (Subsection \ref{['sec:FAA_Data']}), and Gold Standards (GSs) development (Subsection \ref{['sec:GS']}). Then, we select four tools for each stage of the KE workflow (Subsection \ref{['sec:tools_selection']}) These sixteen resultant tools are then implemented on the OMIn dataset, each creating a set of results in the form of named entities (NER), co-references (CR), linked entities (NEL), or relational triples (RE). The experimental setup for this implementation is captured in Subsection \ref{['sec:experimental_setup']}. These results are then evaluated against their respective GSs, or in the case of RE, against qualitative standards for knowledge representation in KGs. Development and implementation of evaluation metrics is discussed in Subsection \ref{['sec:eval_metrics']}.
  • Figure 3: Sample KE Tasks Implementation. Here, the KE Tasks are applied to the sentence on the bottom from OMIn to generate the entities and coreferences annotated on the sentence itself as well as the relational triples and links represented in the graph on the top. Named entities, like aircraft and Pilot, are denoted in blue, and may be understood as either direct results from an NER implementation or intermediate results from an NER subtask via NEL and RE. Then, the CR system (purple) recognizes that the aircraft refers to the same entity in different parts of the sentence, ensuring information relating to the aircraft is consolidated around one node. NEL (green) connects recognized entities to their corresponding Wikidata entries, such as aircraft (Q11436) and aircraft pilot (Q2095549). Finally, RE (red) identifies relationships between entities, with the red edges representing Wikidata properties, such as the pilot operating the aircraft or the brake being a part of the aircraft.
  • Figure 4: Distribution of Document Lengths in OMIn. The OMIn Dataset features 2748 short documents, usually 1-3 incomplete sentences, which are drawn from accident/incident reports captured in AID. The documents range between 2 and 25 words. The mean is 17.23, and the standard deviation is 3.14. The Q1 is 16; the median is 18; and Q3 is 19.
  • Figure 5: OMIn Dataset Curation. The OMIn Dataset is a subsection of maintenance-related incidents from the FAA Accident/Incident Dataset. A random selection of 100 records from OMIn were chosen as the basis for task-specific gold standards. The remaining 2648 documents in OMIn are un-labeled.