Can Large Language Models Automate the Refinement of Cellular Network Specifications?

Jianshuo Dong; Yuanjie Li; Jun Liu; Hewu Li; Han Qiu

Can Large Language Models Automate the Refinement of Cellular Network Specifications?

Jianshuo Dong, Yuanjie Li, Jun Liu, Hewu Li, Han Qiu

TL;DR

This paper investigates automating cellular network specification refinement with Large Language Models by leveraging Change Requests (CRs) as domain data. It introduces CR-eval, a 200-item benchmark covering three domain tasks (discover-weakness, outline-revision, reflect-revision) and evaluates 31 LLMs, revealing strong performance by reasoning-powered and domain-specialized models on challenging weaknesses discovery. The authors propose a three-stage fine-tuning framework (DACT, TST, SCT) plus rationale augmentation, achieving substantial gains on domain tasks with an 8B domain-specialized model (CRitic-LLaMA-3.1-8B) that approaches or surpasses some larger proprietary baselines on key tasks and demonstrates effectiveness against known cellular attacks. The work provides a systematic pathway for domain-specific LLM refinement in safety-conscious settings, highlights the remaining challenges (long-context, calibration, and comprehensive coverage), and releases code and benchmarks to foster further research and practical evaluation.

Abstract

Cellular networks, e.g., 4G/5G, rely on complex technical specifications to ensure correct functionality; however, these specifications often contain flaws or ambiguities. In this paper, we investigate the application of Large Language Models for automated cellular network specification refinement. We identify Change Requests, which record specification revisions, as a key source of domain-specific data and formulate specification refinement as three complementary sub-tasks. We introduce CR-Eval, a benchmark of 200 security-related test cases, and evaluate 17 open-source and 14 proprietary models. The best-performing model, GPT-o3-mini, identifies weaknesses in over 127 test cases within five trials. We further study LLM specialization, showing that fine-tuning an 8B model can outperform advanced LLMs such as DeepSeek-R1 and Qwen3-235B. Evaluations on 30 real-world cellular attacks demonstrate the practical impact and remaining challenges. The codebase and benchmark are available at https://github.com/jianshuod/CR-Eval.

Can Large Language Models Automate the Refinement of Cellular Network Specifications?

TL;DR

Abstract

Can Large Language Models Automate the Refinement of Cellular Network Specifications?

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (16)