Table of Contents
Fetching ...

LLM-Based Insight Extraction for Contact Center Analytics and Cost-Efficient Deployment

Varsha Embar, Ritvik Shrivastava, Vinay Damodaran, Travis Mehlinger, Yu-Chung Hsiao, Karthik Raghunathan

TL;DR

This paper addresses the challenge of extracting actionable insights from contact-center transcripts by focusing on concise call drivers instead of broad topics. It presents a production-ready LLM-based system that generates call drivers and leverages them for topic modeling, trend detection, and FAQ generation, while employing cost-saving strategies such as Multi-LoRA and input compression. The authors evaluate proprietary, open-weight, and fine-tuned models, and demonstrate that a 4-bit LoRA-tuned Mistral-7B-Instruct-v0.2 closely matches human annotations and delivers substantial cost savings, with topic modeling using HDBSCAN producing coherent clusters. The work showcases a practical, privacy-conscious deployment on Kubernetes (EKS) with spot instances, achieving scalable, low-latency CC analytics and actionable insights for agents and administrators.

Abstract

Large Language Models have transformed the Contact Center industry, manifesting in enhanced self-service tools, streamlined administrative processes, and augmented agent productivity. This paper delineates our system that automates call driver generation, which serves as the foundation for tasks such as topic modeling, incoming call classification, trend detection, and FAQ generation, delivering actionable insights for contact center agents and administrators to consume. We present a cost-efficient LLM system design, with 1) a comprehensive evaluation of proprietary, open-weight, and fine-tuned models and 2) cost-efficient strategies, and 3) the corresponding cost analysis when deployed in production environments.

LLM-Based Insight Extraction for Contact Center Analytics and Cost-Efficient Deployment

TL;DR

This paper addresses the challenge of extracting actionable insights from contact-center transcripts by focusing on concise call drivers instead of broad topics. It presents a production-ready LLM-based system that generates call drivers and leverages them for topic modeling, trend detection, and FAQ generation, while employing cost-saving strategies such as Multi-LoRA and input compression. The authors evaluate proprietary, open-weight, and fine-tuned models, and demonstrate that a 4-bit LoRA-tuned Mistral-7B-Instruct-v0.2 closely matches human annotations and delivers substantial cost savings, with topic modeling using HDBSCAN producing coherent clusters. The work showcases a practical, privacy-conscious deployment on Kubernetes (EKS) with spot instances, achieving scalable, low-latency CC analytics and actionable insights for agents and administrators.

Abstract

Large Language Models have transformed the Contact Center industry, manifesting in enhanced self-service tools, streamlined administrative processes, and augmented agent productivity. This paper delineates our system that automates call driver generation, which serves as the foundation for tasks such as topic modeling, incoming call classification, trend detection, and FAQ generation, delivering actionable insights for contact center agents and administrators to consume. We present a cost-efficient LLM system design, with 1) a comprehensive evaluation of proprietary, open-weight, and fine-tuned models and 2) cost-efficient strategies, and 3) the corresponding cost analysis when deployed in production environments.

Paper Structure

This paper contains 23 sections, 2 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: A sample call transcript with call driver generations from various models.
  • Figure 2: Call driver length distributions reveal notable differences among models. The zero-shot baselines (GPT and Mistral) tend to generate longer call drivers, while our fine-tuned model align closely with human annotations, despite being trained on a separate synthetic dataset. Further analysis indicates that longer call drivers often include multiple detailed call reasons and are more likely to be rated as entailment neutral. This negatively impacts end-to-end performance (Table \ref{['tab:results']}).
  • Figure 3: Topic modeling pipeline. A single Mistral model is deployed for both Call Driver Generation (LoRa fine-tuned) and Topic Labeling (backbone) as part of our cost-efficient strategy.
  • Figure 4: LLM hosting architecture. Circled numbers ➀: Steps for model inference. Filled circled numbers ➋: Steps to scale and host LLM models. KEDA monitors Queue workloads and triggers Karpenter to provision new GPU instances. This design allows us to scale up from and down to zero instances and prioritize spot over on-demand instances for the cost consideration.