TSpec-LLM: An Open-source Dataset for LLM Understanding of 3GPP Specifications
Rasoul Nikbakht, Mohamed Benzaghta, Giovanni Geraci
TL;DR
3GPP standards are large and intricate, challenging for LLMs to understand holistically. The authors present TSpec-LLM, an open-source dataset encompassing all 3GPP documents from Release 8 to Release 19 (1999–2023), preserving original tables and formulas to support telecom-focused pre-training, fine-tuning, and retrieval-augmented workflows. They evaluate baseline LLMs on telecom-specific questions and demonstrate that a naive-RAG pipeline using TSpec-LLM improves accuracy from 44–51% to 71–75%, with substantial gains on hard questions (66%). This work provides a practical, publicly available resource and a scalable retrieval-augmented approach for improving domain-specific understanding in telecom specifications, with potential for offline deployment and future accuracy gains through advanced indexing and task-focused datasets.
Abstract
Understanding telecom standards involves sorting through numerous technical documents, such as those produced by the 3rd Generation Partnership Project (3GPP), which is time-consuming and labor-intensive. While large language models (LLMs) can assist with the extensive 3GPP knowledge base, an inclusive dataset is crucial for their effective pre-training and fine-tuning. In this paper, we introduce \textit{TSpec-LLM}, an open-source comprehensive dataset covering all 3GPP documents from Release 8 to Release 19 (1999--2023). To evaluate its efficacy, we first select a representative sample of 3GPP documents, create corresponding technical questions, and assess the baseline performance of various LLMs. We then incorporate a retrieval-augmented generation (RAG) framework to enhance LLM capabilities by retrieving relevant context from the \textit{TSpec-LLM} dataset. Our evaluation shows that using a naive-RAG framework on \textit{TSpec-LLM} improves the accuracy of GPT-3.5, Gemini 1.0 Pro, and GPT-4 from 44\%, 46\%, and 51\% to 71\%, 75\%, and 72\%, respectively.
