Table of Contents
Fetching ...

Towards Enhanced RAC Accessibility: Leveraging Datasets and LLMs

Edison Jair Bejarano Sepulveda, Nicolai Potes Hector, Santiago Pineda Montoya, Felipe Ivan Rodriguez, Jaime Enrique Orduy, Alec Rosales Cabezas, Danny Traslaviña Navarrete, Sergio Madrid Farfan

TL;DR

The Aeronautical Regulations of Colombia (RAC) are highly complex and not easily accessible to non-experts. The authors propose a data-driven approach: assemble a fully curated RAC dataset from the initial five documents, annotate it with domain experts, and fine-tune a GEMMA-based LLM using PEFT with LoRA to answer RAC-related queries. They release a 24,478 Q&A RAC dataset and a fine-tuned RAC-focused model, demonstrating a scalable workflow for dataset creation, annotation, and model adaptation in regulatory domains. This work aims to enhance RAC comprehension, reduce reliance on expert consultations, and enable broader, practical navigation of aviation regulations.

Abstract

This paper explores the potential of large language models (LLMs) to make the Aeronautical Regulations of Colombia (RAC) more accessible. Given the complexity and extensive technicality of the RAC, this study introduces a novel approach to simplifying these regulations for broader understanding. By developing the first-ever RAC database, which contains 24,478 expertly labeled question-and-answer pairs, and fine-tuning LLMs specifically for RAC applications, the paper outlines the methodology for dataset assembly, expert-led annotation, and model training. Utilizing the Gemma1.1 2b model along with advanced techniques like Unsloth for efficient VRAM usage and flash attention mechanisms, the research aims to expedite training processes. This initiative establishes a foundation to enhance the comprehensibility and accessibility of RAC, potentially benefiting novices and reducing dependence on expert consultations for navigating the aviation industry's regulatory landscape. You can visit the dataset (https://huggingface.co/somosnlp/gemma-1.1-2b-it_ColombiaRAC_FullyCurated_format_chatML_V1) and the model (https://huggingface.co/datasets/somosnlp/ColombiaRAC_FullyCurated) here.

Towards Enhanced RAC Accessibility: Leveraging Datasets and LLMs

TL;DR

The Aeronautical Regulations of Colombia (RAC) are highly complex and not easily accessible to non-experts. The authors propose a data-driven approach: assemble a fully curated RAC dataset from the initial five documents, annotate it with domain experts, and fine-tune a GEMMA-based LLM using PEFT with LoRA to answer RAC-related queries. They release a 24,478 Q&A RAC dataset and a fine-tuned RAC-focused model, demonstrating a scalable workflow for dataset creation, annotation, and model adaptation in regulatory domains. This work aims to enhance RAC comprehension, reduce reliance on expert consultations, and enable broader, practical navigation of aviation regulations.

Abstract

This paper explores the potential of large language models (LLMs) to make the Aeronautical Regulations of Colombia (RAC) more accessible. Given the complexity and extensive technicality of the RAC, this study introduces a novel approach to simplifying these regulations for broader understanding. By developing the first-ever RAC database, which contains 24,478 expertly labeled question-and-answer pairs, and fine-tuning LLMs specifically for RAC applications, the paper outlines the methodology for dataset assembly, expert-led annotation, and model training. Utilizing the Gemma1.1 2b model along with advanced techniques like Unsloth for efficient VRAM usage and flash attention mechanisms, the research aims to expedite training processes. This initiative establishes a foundation to enhance the comprehensibility and accessibility of RAC, potentially benefiting novices and reducing dependence on expert consultations for navigating the aviation industry's regulatory landscape. You can visit the dataset (https://huggingface.co/somosnlp/gemma-1.1-2b-it_ColombiaRAC_FullyCurated_format_chatML_V1) and the model (https://huggingface.co/datasets/somosnlp/ColombiaRAC_FullyCurated) here.
Paper Structure (11 sections, 2 figures, 4 tables)

This paper contains 11 sections, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Process for data extraction from RAC PDFs using GPT API.
  • Figure 2: Flow diagram for system annotator.