Using generative AI to support standardization work -- the case of 3GPP
Miroslaw Staron, Jonathan Strom, Albin Karlsson, Wilhelm Meding
TL;DR
The paper investigates how large language models can assist 3GPP standardization by automatically summarizing contributor documents and revealing agreements and discussion points. It implements a design science artifact that combines BART XLM for heading summarization, All-MiniLM embeddings for section-level similarity, and cosine-based, heading-weighted similarity to produce usable visualizations and agenda proposals, then validates the approach with Ericsson and a 3GPP delegate. Findings show strong correlations with human judgments in some analyses (up to 0.98) but weaker alignment at the full-document level (≈0.49) and highlight the need for domain-specific pretraining to capture technical nuances and proposals. The study demonstrates potential to reduce effort and accelerate consensus-building in standardization, while outlining concrete future work such as domain-specific 3GPP training and multimedia information extraction to improve accuracy and trust.
Abstract
Standardization processes build upon consensus between partners, which depends on their ability to identify points of disagreement and resolving them. Large standardization organizations, like the 3GPP or ISO, rely on leaders of work packages who can correctly, and efficiently, identify disagreements, discuss them and reach a consensus. This task, however, is effort-, labor-intensive and costly. In this paper, we address the problem of identifying similarities, dissimilarities and discussion points using large language models. In a design science research study, we work with one of the organizations which leads several workgroups in the 3GPP standard. Our goal is to understand how well the language models can support the standardization process in becoming more cost-efficient, faster and more reliable. Our results show that generic models for text summarization correlate well with domain expert's and delegate's assessments (Pearson correlation between 0.66 and 0.98), but that there is a need for domain-specific models to provide better discussion materials for the standardization groups.
