Scaling Laws for Discriminative Classification in Large Language Models
Dean Wyatte, Fatemeh Tahmasbi, Ming Li, Thomas Markovich
TL;DR
The paper addresses the challenge of deploying large language models in customer support by reframing the task as discriminative template classification to avoid generation-based hallucinations. It introduces a two-stage pipeline—domain-adaptive pre-training on domain data followed by discriminative fine-tuning with a $N_{classes}=640$ output head to predict top-$K$ templates (with $K=5$)—and provides scaling-law insights that connect language modeling loss, compute (FLOPs), and labeled-token data to classifier performance. Key contributions include the first industrial implementation of discriminative fine-tuning for LLMs in a production setting, empirical scaling relations for closed-domain adaptation, and practical guidance on latency vs. accuracy for online customer-support workflows. The online case study demonstrates tangible efficiency gains, with measurable reductions in template selection time and consistent performance improvements across retraining cycles, underscoring the approach's value for real-world, human-in-the-loop deployments and potential extension to other fixed-classification tasks.
Abstract
Modern large language models (LLMs) represent a paradigm shift in what can plausibly be expected of machine learning models. The fact that LLMs can effectively generate sensible answers to a diverse range of queries suggests that they would be useful in customer support applications. While powerful, LLMs have been observed to be prone to hallucination which unfortunately makes their near term use in customer support applications challenging. To address this issue we present a system that allows us to use an LLM to augment our customer support advocates by re-framing the language modeling task as a discriminative classification task. In this framing, we seek to present the top-K best template responses for a customer support advocate to use when responding to a customer. We present the result of both offline and online experiments where we observed offline gains and statistically significant online lifts for our experimental system. Along the way, we present observed scaling curves for validation loss and top-K accuracy, resulted from model parameter ablation studies. We close by discussing the space of trade-offs with respect to model size, latency, and accuracy as well as and suggesting future applications to explore.
