MM-Telco: Benchmarks and Multimodal Large Language Models for Telecom Applications
Gagan Raj Gupta, Anshul Kumar, Manish Rai, Apu Chakraborty, Ashutosh Modi, Abdelaali Chaoub, Soumajit Pramanik, Moyank Giri, Yashwanth Holla, Sunny Kumar, M. V. Kiran Sooraj
TL;DR
MM-Telco introduces a telecom-focused multimodal benchmark and domain-adapted models to bridge the gap between general-purpose LLMs and practical telecom applications. By structuring 3GPP Release 17 content into a knowledge graph and designing 10 multimodal tasks (text and image) including MCQs, long-form QA, information retrieval, and image understanding, the framework enables systematic evaluation of LLMs and VLMs in network operations, documentation, and troubleshooting. The authors also present a fine-tuned Llama-based image generator (Llama-VL-Telco) and employ PEFT via LoRA to balance performance and compute. Across results, specialized fine-tuning improves telecom-specific task accuracy, while retrieval-augmented approaches and real-time AI agents show practical gains for incident resolution and documentation workflows. The work highlights both progress and remaining challenges in robust multimodal reasoning for telecom, with clear implications for industry adoption and ongoing standardization efforts.
Abstract
Large Language Models (LLMs) have emerged as powerful tools for automating complex reasoning and decision-making tasks. In telecommunications, they hold the potential to transform network optimization, automate troubleshooting, enhance customer support, and ensure regulatory compliance. However, their deployment in telecom is hindered by domain-specific challenges that demand specialized adaptation. To overcome these challenges and to accelerate the adaptation of LLMs for telecom, we propose MM-Telco, a comprehensive suite of multimodal benchmarks and models tailored for the telecom domain. The benchmark introduces various tasks (both text based and image based) that address various practical real-life use cases such as network operations, network management, improving documentation quality, and retrieval of relevant text and images. Further, we perform baseline experiments with various LLMs and VLMs. The models fine-tuned on our dataset exhibit a significant boost in performance. Our experiments also help analyze the weak areas in the working of current state-of-art multimodal LLMs, thus guiding towards further development and research.
