Table of Contents
Fetching ...

TSpec-LLM: An Open-source Dataset for LLM Understanding of 3GPP Specifications

Rasoul Nikbakht, Mohamed Benzaghta, Giovanni Geraci

TL;DR

3GPP standards are large and intricate, challenging for LLMs to understand holistically. The authors present TSpec-LLM, an open-source dataset encompassing all 3GPP documents from Release 8 to Release 19 (1999–2023), preserving original tables and formulas to support telecom-focused pre-training, fine-tuning, and retrieval-augmented workflows. They evaluate baseline LLMs on telecom-specific questions and demonstrate that a naive-RAG pipeline using TSpec-LLM improves accuracy from 44–51% to 71–75%, with substantial gains on hard questions (66%). This work provides a practical, publicly available resource and a scalable retrieval-augmented approach for improving domain-specific understanding in telecom specifications, with potential for offline deployment and future accuracy gains through advanced indexing and task-focused datasets.

Abstract

Understanding telecom standards involves sorting through numerous technical documents, such as those produced by the 3rd Generation Partnership Project (3GPP), which is time-consuming and labor-intensive. While large language models (LLMs) can assist with the extensive 3GPP knowledge base, an inclusive dataset is crucial for their effective pre-training and fine-tuning. In this paper, we introduce \textit{TSpec-LLM}, an open-source comprehensive dataset covering all 3GPP documents from Release 8 to Release 19 (1999--2023). To evaluate its efficacy, we first select a representative sample of 3GPP documents, create corresponding technical questions, and assess the baseline performance of various LLMs. We then incorporate a retrieval-augmented generation (RAG) framework to enhance LLM capabilities by retrieving relevant context from the \textit{TSpec-LLM} dataset. Our evaluation shows that using a naive-RAG framework on \textit{TSpec-LLM} improves the accuracy of GPT-3.5, Gemini 1.0 Pro, and GPT-4 from 44\%, 46\%, and 51\% to 71\%, 75\%, and 72\%, respectively.

TSpec-LLM: An Open-source Dataset for LLM Understanding of 3GPP Specifications

TL;DR

3GPP standards are large and intricate, challenging for LLMs to understand holistically. The authors present TSpec-LLM, an open-source dataset encompassing all 3GPP documents from Release 8 to Release 19 (1999–2023), preserving original tables and formulas to support telecom-focused pre-training, fine-tuning, and retrieval-augmented workflows. They evaluate baseline LLMs on telecom-specific questions and demonstrate that a naive-RAG pipeline using TSpec-LLM improves accuracy from 44–51% to 71–75%, with substantial gains on hard questions (66%). This work provides a practical, publicly available resource and a scalable retrieval-augmented approach for improving domain-specific understanding in telecom specifications, with potential for offline deployment and future accuracy gains through advanced indexing and task-focused datasets.

Abstract

Understanding telecom standards involves sorting through numerous technical documents, such as those produced by the 3rd Generation Partnership Project (3GPP), which is time-consuming and labor-intensive. While large language models (LLMs) can assist with the extensive 3GPP knowledge base, an inclusive dataset is crucial for their effective pre-training and fine-tuning. In this paper, we introduce \textit{TSpec-LLM}, an open-source comprehensive dataset covering all 3GPP documents from Release 8 to Release 19 (1999--2023). To evaluate its efficacy, we first select a representative sample of 3GPP documents, create corresponding technical questions, and assess the baseline performance of various LLMs. We then incorporate a retrieval-augmented generation (RAG) framework to enhance LLM capabilities by retrieving relevant context from the \textit{TSpec-LLM} dataset. Our evaluation shows that using a naive-RAG framework on \textit{TSpec-LLM} improves the accuracy of GPT-3.5, Gemini 1.0 Pro, and GPT-4 from 44\%, 46\%, and 51\% to 71\%, 75\%, and 72\%, respectively.
Paper Structure (8 sections, 7 figures, 1 table)

This paper contains 8 sections, 7 figures, 1 table.

Figures (7)

  • Figure 1: Word counts and file sizes for the TSpec-LLM dataset across various 3GPP releases. Data cut-off is December 2023.
  • Figure 2: Illustration of the naive-RAG paradigm, where documents are divided into chunks and stored in a vector database. User queries are matched with relevant chunks, which are then used to generate a prompt for an LLM to provide a coherent response.
  • Figure 3: Accuracy comparison among GPT-3.5, GPT-4, and Gemini, with and without employing naive-RAG on the TSpec-LLM dataset.
  • Figure 4: Accuracy comparison among GPT-3.5, GPT-4, and Gemini, on specified difficulty categories, with and without employing naive-RAG on the TSpec-LLM dataset.
  • Figure 5: Confidence levels, showing the frequency of the probabilities of correctness assigned to the answers generated by Gemini with RAG + TSpec-LLM.
  • ...and 2 more figures