Observations on LLMs for Telecom Domain: Capabilities and Limitations
Sumit Soman, Ranjani H G
TL;DR
The paper investigates how state-of-the-art generative LLMs perform in telecom-domain chat interfaces using Cradlepoint data, addressing domain terminology adaptation, context continuity, robustness to input perturbations, and prompting strategies. It benchmarks GPT-4, GPT-3.5, Bard, and LLaMA across domain Q&A, product Q&A, context retention, and perturbation resilience, revealing that GPT-4 and Bard yield the strongest domain responses while LLaMA lags. Context retention is strongest with GPT-4 and Bard, though URL grounding and product-specific accuracy remain concerns, especially for highly specific items. The study highlights the need for domain-specific fine-tuning, data freshness, and grounding mechanisms (including potential plugins) before deployment in enterprise telecom settings, and suggests future work on improving clarification, hallucination mitigation, and domain adaptation.
Abstract
The landscape for building conversational interfaces (chatbots) has witnessed a paradigm shift with recent developments in generative Artificial Intelligence (AI) based Large Language Models (LLMs), such as ChatGPT by OpenAI (GPT3.5 and GPT4), Google's Bard, Large Language Model Meta AI (LLaMA), among others. In this paper, we analyze capabilities and limitations of incorporating such models in conversational interfaces for the telecommunication domain, specifically for enterprise wireless products and services. Using Cradlepoint's publicly available data for our experiments, we present a comparative analysis of the responses from such models for multiple use-cases including domain adaptation for terminology and product taxonomy, context continuity, robustness to input perturbations and errors. We believe this evaluation would provide useful insights to data scientists engaged in building customized conversational interfaces for domain-specific requirements.
