Scaling Laws for Discriminative Classification in Large Language Models

Dean Wyatte; Fatemeh Tahmasbi; Ming Li; Thomas Markovich

Scaling Laws for Discriminative Classification in Large Language Models

Dean Wyatte, Fatemeh Tahmasbi, Ming Li, Thomas Markovich

TL;DR

The paper addresses the challenge of deploying large language models in customer support by reframing the task as discriminative template classification to avoid generation-based hallucinations. It introduces a two-stage pipeline—domain-adaptive pre-training on domain data followed by discriminative fine-tuning with a $N_{classes}=640$ output head to predict top-$K$ templates (with $K=5$)—and provides scaling-law insights that connect language modeling loss, compute (FLOPs), and labeled-token data to classifier performance. Key contributions include the first industrial implementation of discriminative fine-tuning for LLMs in a production setting, empirical scaling relations for closed-domain adaptation, and practical guidance on latency vs. accuracy for online customer-support workflows. The online case study demonstrates tangible efficiency gains, with measurable reductions in template selection time and consistent performance improvements across retraining cycles, underscoring the approach's value for real-world, human-in-the-loop deployments and potential extension to other fixed-classification tasks.

Abstract

Modern large language models (LLMs) represent a paradigm shift in what can plausibly be expected of machine learning models. The fact that LLMs can effectively generate sensible answers to a diverse range of queries suggests that they would be useful in customer support applications. While powerful, LLMs have been observed to be prone to hallucination which unfortunately makes their near term use in customer support applications challenging. To address this issue we present a system that allows us to use an LLM to augment our customer support advocates by re-framing the language modeling task as a discriminative classification task. In this framing, we seek to present the top-K best template responses for a customer support advocate to use when responding to a customer. We present the result of both offline and online experiments where we observed offline gains and statistically significant online lifts for our experimental system. Along the way, we present observed scaling curves for validation loss and top-K accuracy, resulted from model parameter ablation studies. We close by discussing the space of trade-offs with respect to model size, latency, and accuracy as well as and suggesting future applications to explore.

Scaling Laws for Discriminative Classification in Large Language Models

TL;DR

output head to predict top-

templates (with

)—and provides scaling-law insights that connect language modeling loss, compute (FLOPs), and labeled-token data to classifier performance. Key contributions include the first industrial implementation of discriminative fine-tuning for LLMs in a production setting, empirical scaling relations for closed-domain adaptation, and practical guidance on latency vs. accuracy for online customer-support workflows. The online case study demonstrates tangible efficiency gains, with measurable reductions in template selection time and consistent performance improvements across retraining cycles, underscoring the approach's value for real-world, human-in-the-loop deployments and potential extension to other fixed-classification tasks.

Abstract

Paper Structure (13 sections, 9 figures, 5 tables)

This paper contains 13 sections, 9 figures, 5 tables.

Introduction
Related Work
Preliminaries
Methods
Customer Support Dataset
Domain Adaptive Pre-training
Discriminative Fine-tuning
Model Updates
Baseline Models
Experiments
Offline Training and Evaluation
Online Case Study
Conclusion

Figures (9)

Figure 1: Example customer support case with most relevant template responses.
Figure 2: Training pipeline for domain adaptation and discriminative fine-tuning.
Figure 3: Discriminative fine-tuning empirical scaling properties across different model sizes.
Figure 4: Weekly absolute difference in selection time between holdout and treatment groups.
Figure 5: Selection time between holdout and treatment groups over model lifetime
...and 4 more figures

Scaling Laws for Discriminative Classification in Large Language Models

TL;DR

Abstract

Scaling Laws for Discriminative Classification in Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (9)