Reducing the Scope of Language Models
David Yunis, Siyu Huo, Chulaka Gunasekara, Danish Contractor
TL;DR
The paper tackles the challenge of scoping large language models to domain-specific tasks by forcing generation only for relevant queries and rejecting irrelevant ones. It evaluates a range of techniques—system prompting, supervised fine-tuning, direct preference optimization, probing-based classifiers, and Circuit Breakers—across three model families and multiple task categories, using Accept Score and rejection metrics on in- and out-of-distribution data. Key findings show that supervised fine-tuning shines when rejection data are diverse, while Circuit Breakers perform best under low-diversity conditions or adversarial prompts; layering SFT with CB often yields robust, practical performance. The work provides practical guidance for practitioners on selecting, combining, and tuning scoping methods for real deployments, including insights on data diversity, model scale, and interpretation of internal representations.
Abstract
Large language models (LLMs) are deployed in a wide variety of user-facing applications. Typically, these deployments have some specific purpose, like answering questions grounded on documentation or acting as coding assistants, but they require general language understanding. In such deployments, LLMs should respond only to queries that align with the intended purpose and reject all other requests, such as generating poetry or answering questions about physics, a task we refer to as `scoping'. We conduct a comprehensive empirical evaluation of various methods, ranging from prompting, fine-tuning to preference learning and the recently proposed general alignment technique known as Circuit Breakers (CB). Across three families of language models and a broad variety of tasks, we show that it is possible to scope language models. We examine scoping for multiple topics, and fine-grained topics. We ablate diversity of irrelevant queries, layer different techniques, conduct adversarial evaluations and more. Among other results, we find that when diverse examples of irrelevant queries are available, simple supervised fine-tuning produces the best results, but when such diversity is low, Circuit Breakers perform quite well. One can often get the benefits of both methods by layering them in succession. We intend our study to serve as a practitioner's guide to scoping LLMs.
