Trusted Knowledge Extraction for Operations and Maintenance Intelligence
Kathleen P. Mealey, Jonathan A. Karr, Priscila Saboia Moreira, Paul R. Brenner, Charles F. Vardeman
TL;DR
The paper tackles the challenge of deriving operational and maintenance intelligence from domain-specific, largely unstructured aviation data. It introduces OMIn, a publicly available FAA-derived benchmark with gold standards for NER, CR, and NEL, and performs a zero-shot evaluation of sixteen open-source KE tools across NER, CR, NEL, and RE in an on-premises, confidential setting. Findings reveal significant domain-transfer gaps: NER and NEL show low recall, CR transfers more robustly, and RE suffers from heterogeneous ontologies and tool-specific relation schemas, leading to limited immediate deployment readiness. The work highlights the need for domain-adapted resources, a shared maintenance ontology for fair RE benchmarking, and open datasets to accelerate trusted KE for safety-critical industries, while providing a reproducible baseline for future research and domain adaptation efforts.
Abstract
Deriving operational intelligence from organizational data repositories is a key challenge due to the dichotomy of data confidentiality vs data integration objectives, as well as the limitations of Natural Language Processing (NLP) tools relative to the specific knowledge structure of domains such as operations and maintenance. In this work, we discuss Knowledge Graph construction and break down the Knowledge Extraction process into its Named Entity Recognition, Coreference Resolution, Named Entity Linking, and Relation Extraction functional components. We then evaluate sixteen NLP tools in concert with or in comparison to the rapidly advancing capabilities of Large Language Models (LLMs). We focus on the operational and maintenance intelligence use case for trusted applications in the aircraft industry. A baseline dataset is derived from a rich public domain US Federal Aviation Administration dataset focused on equipment failures or maintenance requirements. We assess the zero-shot performance of NLP and LLM tools that can be operated within a controlled, confidential environment (no data is sent to third parties). Based on our observation of significant performance limitations, we discuss the challenges related to trusted NLP and LLM tools as well as their Technical Readiness Level for wider use in mission-critical industries such as aviation. We conclude with recommendations to enhance trust and provide our open-source curated dataset to support further baseline testing and evaluation.
