KEO: Knowledge Extraction on OMIn via Knowledge Graphs and RAG for Safety-Critical Aviation Maintenance
Kuangshi Ai, Jonathan A. Karr, Meng Jiang, Nitesh V. Chawla, Chaoli Wang
TL;DR
This work tackles the challenge of applying large language models in safety-critical aviation maintenance where knowledge gaps and hallucinations hinder reliability. It introduces KEO, a KG-augmented RAG framework that builds a structured knowledge graph from the OMIn dataset and integrates it into a retrieval pipeline for coherent, dataset-wide reasoning, formalized as $G=(V,E)$ with seed nodes $V_k$ and $m$-hop expansions $G^{(m)}$ whose structure is refined into maximum spanning trees. A three-part methodology — KG creation, KG-based RAG, and automatic QA benchmark construction — is complemented by LLM-based evaluation using both offline, locally deployable models (e.g., Gemma-3, Phi-4, Mistral) and stronger judges (GPT-4o, Llama-3.3) to assess global sensemaking and knowledge-to-action performance. Empirical results show that KG-RAG substantially improves dataset-wide sensemaking and pattern discovery, while traditional text-chunk RAG remains competitive for fine-grained procedural tasks, highlighting a complementary role for structured context in high-stakes QA and the importance of secure local deployment. The findings underscore KG-augmented LLMs as a promising approach for safe, domain-specific QA in aviation maintenance and potentially other safety-critical domains, with future work on domain adaptation, scaling, and multimodal data integration.
Abstract
We present Knowledge Extraction on OMIn (KEO), a domain-specific knowledge extraction and reasoning framework with large language models (LLMs) in safety-critical contexts. Using the Operations and Maintenance Intelligence (OMIn) dataset, we construct a QA benchmark spanning global sensemaking and actionable maintenance tasks. KEO builds a structured Knowledge Graph (KG) and integrates it into a retrieval-augmented generation (RAG) pipeline, enabling more coherent, dataset-wide reasoning than traditional text-chunk RAG. We evaluate locally deployable LLMs (Gemma-3, Phi-4, Mistral-Nemo) and employ stronger models (GPT-4o, Llama-3.3) as judges. Experiments show that KEO markedly improves global sensemaking by revealing patterns and system-level insights, while text-chunk RAG remains effective for fine-grained procedural tasks requiring localized retrieval. These findings underscore the promise of KG-augmented LLMs for secure, domain-specific QA and their potential in high-stakes reasoning.
