Scalable and Efficient Large-Scale Log Analysis with LLMs: An IT Software Support Case Study
Pranjal Gupta, Karan Bhukar, Harshit Kumar, Seema Nagar, Prateeti Mohapatra, Debanjana Kar
TL;DR
This work tackles the impracticality of manual log analysis in large IT environments by introducing a CPU-friendly LLM-based log analytics tool. The system uses log templatisation and a representative set to enable scalable, CPU-only inference via Label Broadcasting across log clusters, producing multiple reports (Summary, Diagnosis, Temporal Trend, Causal Graph) that aid issue diagnosis. BERTOps underpins the three tasks—Golden Signals, Fault Categories, and Named Entities—with strong cross-domain generalization demonstrated across four real-world datasets and a finance-case study, achieving substantial speedups and data reductions while maintaining accuracy. The tool has been deployed across 70 IBM software products, processing thousands of tickets and delivering meaningful time and cost savings, illustrating a practical path for AI-assisted IT support at scale.
Abstract
IT environments typically have logging mechanisms to monitor system health and detect issues. However, the huge volume of generated logs makes manual inspection impractical, highlighting the importance of automated log analysis in IT Software Support. In this paper, we propose a log analytics tool that leverages Large Language Models (LLMs) for log data processing and issue diagnosis, enabling the generation of automated insights and summaries. We further present a novel approach for efficiently running LLMs on CPUs to process massive log volumes in minimal time without compromising output quality. We share the insights and lessons learned from deployment of the tool - in production since March 2024 - scaled across 70 software products, processing over 2000 tickets for issue diagnosis, achieving a time savings of 300+ man hours and an estimated $15,444 per month in manpower costs compared to the traditional log analysis practices.
