Table of Contents
Fetching ...

Leveraging Multi-Agent System (MAS) and Fine-Tuned Small Language Models (SLMs) for Automated Telecom Network Troubleshooting

Chenhua Shi, Bhavika Jalli, Gregor Macdonald, John Zou, Wanlu Lei, Mridul Jain, Joji Philip

TL;DR

The paper tackles the challenge of scalable, automated telecom network troubleshooting by integrating a MAS guided by an LLM with a domain-grounded, fine-tuned SLM solution planner. The architecture uses Hypha for agent orchestration, a knowledge graph and HippoRAG for grounding, and a ReAct-style loop with HITL to ensure reliability. Two key contributions are a two-stage fine-tuning pipeline (SFT then RFT) for the SLM solution planner, and an end-to-end pipeline that demonstrates significant operational gains (e.g., up to a six-fold reduction in mean time to troubleshoot and a ~10% accuracy improvement) across RAN and Core domains. The approach also addresses practical deployment concerns by reducing reliance on external LLMs, improving privacy, and enabling efficient, context-grounded remediation via TRL-based reinforcement fine-tuning with LoRA and GRPO.

Abstract

Telecom networks are rapidly growing in scale and complexity, making effective management, operation, and optimization increasingly challenging. Although Artificial Intelligence (AI) has been applied to many telecom tasks, existing models are often narrow in scope, require large amounts of labeled data, and struggle to generalize across heterogeneous deployments. Consequently, network troubleshooting continues to rely heavily on Subject Matter Experts (SMEs) to manually correlate various data sources to identify root causes and corrective actions. To address these limitations, we propose a Multi-Agent System (MAS) that employs an agentic workflow, with Large Language Models (LLMs) coordinating multiple specialized tools for fully automated network troubleshooting. Once faults are detected by AI/ML-based monitors, the framework dynamically activates agents such as an orchestrator, solution planner, executor, data retriever, and root-cause analyzer to diagnose issues and recommend remediation strategies within a short time frame. A key component of this system is the solution planner, which generates appropriate remediation plans based on internal documentation. To enable this, we fine-tuned a Small Language Model (SLM) on proprietary troubleshooting documents to produce domain-grounded solution plans. Experimental results demonstrate that the proposed framework significantly accelerates troubleshooting automation across both Radio Access Network (RAN) and Core network domains.

Leveraging Multi-Agent System (MAS) and Fine-Tuned Small Language Models (SLMs) for Automated Telecom Network Troubleshooting

TL;DR

The paper tackles the challenge of scalable, automated telecom network troubleshooting by integrating a MAS guided by an LLM with a domain-grounded, fine-tuned SLM solution planner. The architecture uses Hypha for agent orchestration, a knowledge graph and HippoRAG for grounding, and a ReAct-style loop with HITL to ensure reliability. Two key contributions are a two-stage fine-tuning pipeline (SFT then RFT) for the SLM solution planner, and an end-to-end pipeline that demonstrates significant operational gains (e.g., up to a six-fold reduction in mean time to troubleshoot and a ~10% accuracy improvement) across RAN and Core domains. The approach also addresses practical deployment concerns by reducing reliance on external LLMs, improving privacy, and enabling efficient, context-grounded remediation via TRL-based reinforcement fine-tuning with LoRA and GRPO.

Abstract

Telecom networks are rapidly growing in scale and complexity, making effective management, operation, and optimization increasingly challenging. Although Artificial Intelligence (AI) has been applied to many telecom tasks, existing models are often narrow in scope, require large amounts of labeled data, and struggle to generalize across heterogeneous deployments. Consequently, network troubleshooting continues to rely heavily on Subject Matter Experts (SMEs) to manually correlate various data sources to identify root causes and corrective actions. To address these limitations, we propose a Multi-Agent System (MAS) that employs an agentic workflow, with Large Language Models (LLMs) coordinating multiple specialized tools for fully automated network troubleshooting. Once faults are detected by AI/ML-based monitors, the framework dynamically activates agents such as an orchestrator, solution planner, executor, data retriever, and root-cause analyzer to diagnose issues and recommend remediation strategies within a short time frame. A key component of this system is the solution planner, which generates appropriate remediation plans based on internal documentation. To enable this, we fine-tuned a Small Language Model (SLM) on proprietary troubleshooting documents to produce domain-grounded solution plans. Experimental results demonstrate that the proposed framework significantly accelerates troubleshooting automation across both Radio Access Network (RAN) and Core network domains.

Paper Structure

This paper contains 15 sections, 7 figures, 1 table.

Figures (7)

  • Figure 1: The architecture diagram for Multi Agent System (MAS) for Automated Network Troubleshooting.
  • Figure 2: RFT pipeline for fine-tuning a Small Language Model (SLM) as a solution planner using Transformers Reinforcement Learning (TRL) across multiple GPUs.
  • Figure 3: Autonomous Network Operations Agent Mean Time to Troubleshoot.
  • Figure 4: Autonomous Network Operations Agent Troubleshooting Accuracy.
  • Figure 5: Mean of Rewards on Training and Evaluation among Different Models.
  • ...and 2 more figures