Table of Contents
Fetching ...

KEO: Knowledge Extraction on OMIn via Knowledge Graphs and RAG for Safety-Critical Aviation Maintenance

Kuangshi Ai, Jonathan A. Karr, Meng Jiang, Nitesh V. Chawla, Chaoli Wang

TL;DR

This work tackles the challenge of applying large language models in safety-critical aviation maintenance where knowledge gaps and hallucinations hinder reliability. It introduces KEO, a KG-augmented RAG framework that builds a structured knowledge graph from the OMIn dataset and integrates it into a retrieval pipeline for coherent, dataset-wide reasoning, formalized as $G=(V,E)$ with seed nodes $V_k$ and $m$-hop expansions $G^{(m)}$ whose structure is refined into maximum spanning trees. A three-part methodology — KG creation, KG-based RAG, and automatic QA benchmark construction — is complemented by LLM-based evaluation using both offline, locally deployable models (e.g., Gemma-3, Phi-4, Mistral) and stronger judges (GPT-4o, Llama-3.3) to assess global sensemaking and knowledge-to-action performance. Empirical results show that KG-RAG substantially improves dataset-wide sensemaking and pattern discovery, while traditional text-chunk RAG remains competitive for fine-grained procedural tasks, highlighting a complementary role for structured context in high-stakes QA and the importance of secure local deployment. The findings underscore KG-augmented LLMs as a promising approach for safe, domain-specific QA in aviation maintenance and potentially other safety-critical domains, with future work on domain adaptation, scaling, and multimodal data integration.

Abstract

We present Knowledge Extraction on OMIn (KEO), a domain-specific knowledge extraction and reasoning framework with large language models (LLMs) in safety-critical contexts. Using the Operations and Maintenance Intelligence (OMIn) dataset, we construct a QA benchmark spanning global sensemaking and actionable maintenance tasks. KEO builds a structured Knowledge Graph (KG) and integrates it into a retrieval-augmented generation (RAG) pipeline, enabling more coherent, dataset-wide reasoning than traditional text-chunk RAG. We evaluate locally deployable LLMs (Gemma-3, Phi-4, Mistral-Nemo) and employ stronger models (GPT-4o, Llama-3.3) as judges. Experiments show that KEO markedly improves global sensemaking by revealing patterns and system-level insights, while text-chunk RAG remains effective for fine-grained procedural tasks requiring localized retrieval. These findings underscore the promise of KG-augmented LLMs for secure, domain-specific QA and their potential in high-stakes reasoning.

KEO: Knowledge Extraction on OMIn via Knowledge Graphs and RAG for Safety-Critical Aviation Maintenance

TL;DR

This work tackles the challenge of applying large language models in safety-critical aviation maintenance where knowledge gaps and hallucinations hinder reliability. It introduces KEO, a KG-augmented RAG framework that builds a structured knowledge graph from the OMIn dataset and integrates it into a retrieval pipeline for coherent, dataset-wide reasoning, formalized as with seed nodes and -hop expansions whose structure is refined into maximum spanning trees. A three-part methodology — KG creation, KG-based RAG, and automatic QA benchmark construction — is complemented by LLM-based evaluation using both offline, locally deployable models (e.g., Gemma-3, Phi-4, Mistral) and stronger judges (GPT-4o, Llama-3.3) to assess global sensemaking and knowledge-to-action performance. Empirical results show that KG-RAG substantially improves dataset-wide sensemaking and pattern discovery, while traditional text-chunk RAG remains competitive for fine-grained procedural tasks, highlighting a complementary role for structured context in high-stakes QA and the importance of secure local deployment. The findings underscore KG-augmented LLMs as a promising approach for safe, domain-specific QA in aviation maintenance and potentially other safety-critical domains, with future work on domain adaptation, scaling, and multimodal data integration.

Abstract

We present Knowledge Extraction on OMIn (KEO), a domain-specific knowledge extraction and reasoning framework with large language models (LLMs) in safety-critical contexts. Using the Operations and Maintenance Intelligence (OMIn) dataset, we construct a QA benchmark spanning global sensemaking and actionable maintenance tasks. KEO builds a structured Knowledge Graph (KG) and integrates it into a retrieval-augmented generation (RAG) pipeline, enabling more coherent, dataset-wide reasoning than traditional text-chunk RAG. We evaluate locally deployable LLMs (Gemma-3, Phi-4, Mistral-Nemo) and employ stronger models (GPT-4o, Llama-3.3) as judges. Experiments show that KEO markedly improves global sensemaking by revealing patterns and system-level insights, while text-chunk RAG remains effective for fine-grained procedural tasks requiring localized retrieval. These findings underscore the promise of KG-augmented LLMs for secure, domain-specific QA and their potential in high-stakes reasoning.

Paper Structure

This paper contains 38 sections, 6 equations, 3 figures, 13 tables.

Figures (3)

  • Figure 1: Overview of the KEO pipeline. Aviation maintenance records and problem–action pairs are first transformed into a QA benchmark covering both global sensemaking and knowledge-to-action questions. In parallel, KEO constructs a structured KG from raw maintenance data. A KG-based RAG workflow then leverages semantic node identification, importance-aware graph expansion, and structured context reconstruction to enhance LLM responses on these safety-critical questions. Finally, an LLM judge evaluates answers through both absolute and comparative scoring with carefully designed metrics.
  • Figure 2: Head-to-head win rate matrix of row method over column method (TC: text-chunk RAG, VN: vanilla LLM, KG: our method KEO) for 83 global sensemaking questions, evaluated by GPT-4o. The KG used is generated with GPT-4o from 100 records. Win rates are reported across five dimensions and overall. Green cells indicate a win, red cells indicate a loss. The proposed KEO method consistently outperforms text-chunk RAG when paired with stronger LLMs, but its performance may degrade with weaker backbone models.
  • Figure 3: Head-to-head win rate matrix of row method over column method (TC: text-chunk RAG, VN: vanilla LLM, KG: our method KEO) on the same 83 global sensemaking questions, evaluated by Llama-3.3-70B-Instruct. Compared to GPT-4o evaluation (Figure \ref{['fig:winrate_gpt']}), Llama shows a higher preference for answers generated using RAG-based methods. Nonetheless, the same trend persists: stronger LLMs tend to amplify the advantage of the KEO approach.