Table of Contents
Fetching ...

Knots: A Large-Scale Multi-Agent Enhanced Expert-Annotated Dataset and LLM Prompt Optimization for NOTAM Semantic Parsing

Maoqi Liu, Quan Fang, Yang Yang, Can Zhao, Kaiquan Cai

TL;DR

This work reframes NOTAM interpretation as semantic parsing requiring domain knowledge, introducing Knots, a large expert-annotated dataset with 12,347 records across 194 FIRs. It proposes a two-stage, data-centric and model-centric framework (MDA-HDF) that first discovers potential information fields and then refines them through structured debate to balance recall and precision. Comprehensive prompts and domain-specific optimizations are systematically evaluated across multiple LLMs, revealing that 5-shot in-context learning with deterministic generation yields the best performance, and that a multi-agent setup significantly improves field discovery and parsing accuracy. The findings offer practical guidelines for automated NOTAM analysis, including dataset-driven improvements, robust prompting strategies, and cautious use of advanced reasoning to ensure safety-critical reliability in aviation applications.

Abstract

Notice to Air Missions (NOTAMs) serve as a critical channel for disseminating key flight safety information, yet their complex linguistic structures and implicit reasoning pose significant challenges for automated parsing. Existing research mainly focuses on surface-level tasks such as classification and named entity recognition, lacking deep semantic understanding. To address this gap, we propose NOTAM semantic parsing, a task emphasizing semantic inference and the integration of aviation domain knowledge to produce structured, inference-rich outputs. To support this task, we construct Knots (Knowledge and NOTAM Semantics), a high-quality dataset of 12,347 expert-annotated NOTAMs covering 194 Flight Information Regions, enhanced through a multi-agent collaborative framework for comprehensive field discovery. We systematically evaluate a wide range of prompt-engineering strategies and model-adaptation techniques, achieving substantial improvements in aviation text understanding and processing. Our experimental results demonstrate the effectiveness of the proposed approach and offer valuable insights for automated NOTAM analysis systems. Our code is available at: https://github.com/Estrellajer/Knots.

Knots: A Large-Scale Multi-Agent Enhanced Expert-Annotated Dataset and LLM Prompt Optimization for NOTAM Semantic Parsing

TL;DR

This work reframes NOTAM interpretation as semantic parsing requiring domain knowledge, introducing Knots, a large expert-annotated dataset with 12,347 records across 194 FIRs. It proposes a two-stage, data-centric and model-centric framework (MDA-HDF) that first discovers potential information fields and then refines them through structured debate to balance recall and precision. Comprehensive prompts and domain-specific optimizations are systematically evaluated across multiple LLMs, revealing that 5-shot in-context learning with deterministic generation yields the best performance, and that a multi-agent setup significantly improves field discovery and parsing accuracy. The findings offer practical guidelines for automated NOTAM analysis, including dataset-driven improvements, robust prompting strategies, and cautious use of advanced reasoning to ensure safety-critical reliability in aviation applications.

Abstract

Notice to Air Missions (NOTAMs) serve as a critical channel for disseminating key flight safety information, yet their complex linguistic structures and implicit reasoning pose significant challenges for automated parsing. Existing research mainly focuses on surface-level tasks such as classification and named entity recognition, lacking deep semantic understanding. To address this gap, we propose NOTAM semantic parsing, a task emphasizing semantic inference and the integration of aviation domain knowledge to produce structured, inference-rich outputs. To support this task, we construct Knots (Knowledge and NOTAM Semantics), a high-quality dataset of 12,347 expert-annotated NOTAMs covering 194 Flight Information Regions, enhanced through a multi-agent collaborative framework for comprehensive field discovery. We systematically evaluate a wide range of prompt-engineering strategies and model-adaptation techniques, achieving substantial improvements in aviation text understanding and processing. Our experimental results demonstrate the effectiveness of the proposed approach and offer valuable insights for automated NOTAM analysis systems. Our code is available at: https://github.com/Estrellajer/Knots.

Paper Structure

This paper contains 37 sections, 12 equations, 6 figures, 8 tables, 1 algorithm.

Figures (6)

  • Figure 1: An illustrative comparison of different paradigms for NOTAM information extraction. The tip of the iceberg represents traditional methods like regex-based rules and NER, which only scratch the surface by extracting explicitly stated keywords. In contrast, the submerged part visualizes the depth required by the "NOTAM parsing" task. This deeper analysis involves semantic understanding and inference to produce a highly structured, hierarchical, and application-ready output, a challenge well-suited for modern LLMs.
  • Figure 2: Category and subcategory distribution of Q-codes within the NOTAM dataset.
  • Figure 3: Overview of the multi-agent field discovery and refinement framework. The pipeline consists of two main stages: (1) Multi-Agent Field Discovery (MDA) for systematic field extraction, and (2) Hybrid Debate Framework (HDF) for collaborative refinement through structured debate and deterministic consolidation.
  • Figure 4: Self-consistency F1-scores vs. temperature on Landing Aid task using Qwen3-8B model.
  • Figure 5: Performance improvement from ICL to SRCV method across different models, showing percentage gains and baseline comparisons.
  • ...and 1 more figures

Theorems & Definitions (2)

  • proof
  • proof