Need a Small Specialized Language Model? Plan Early!

David Grangier; Angelos Katharopoulos; Pierre Ablin; Awni Hannun

Need a Small Specialized Language Model? Plan Early!

David Grangier, Angelos Katharopoulos, Pierre Ablin, Awni Hannun

TL;DR

This paper explores how to get good specialized small language models using a large, generic, pretraining set and a limited amount of specialized data, and proposes a novel architecture, projected networks (PN), a large network whose parameters can be linearly projected into a small network for specialization.

Abstract

Large language models are versatile tools but are not suitable for small inference budgets. Small models have more efficient inference, but their lower capacity means that their performance can be good only if one limits their scope to a specialized domain. This paper explores how to get good specialized small language models using a large, generic, pretraining set and a limited amount of specialized data. We consider two scenarios, depending on whether (i) one can afford pretraining a model for each specialization task, or (ii) one wants to cheaply adapt a single pretrained model for each task. In the first scenario, we propose an effective solution based on importance sampling: we resample the pretraining set to imitate the specialization data and train a small model on it. In the second scenario, we propose a novel architecture, projected networks (PN). PN is a large network whose parameters can be linearly projected into a small network for specialization. For both scenarios, we demonstrate the empirical effectiveness of our solutions across various domains, training set sizes, and training budgets.

Need a Small Specialized Language Model? Plan Early!

TL;DR

Abstract

Paper Structure (26 sections, 5 equations, 14 figures, 10 tables)

This paper contains 26 sections, 5 equations, 14 figures, 10 tables.

Introduction
Methods
Generic and Specialization Datasets
Baselines: Small Models, Fine-Tuning & Distillation
Clustering of the Pretraining Data
Cluster-Based Importance Sampling
Asymmetric Models: Projected Networks and Hard Mixtures
Experimental Setup
Methodology
Datasets
Models Hyper-parameters
Empirical Evaluation
Baselines: Fine-tuning, Distillation
Importance sampling
Asymmetric models: Hard Mixture of Experts and Projected Networks
...and 11 more sections

Figures (14)

Figure 1: Practical recommendations for training LMs that fit a predefined computational budget.
Figure 2: Projected networks (right) unlike distillation (left) instantiate small models in closed-form.
Figure 3: Train cost upper limits for pretraining and specialization (GPUh). Specialization is inexpensive except for SLM-is, SLM-d.
Figure 4: Specialized perplexity on the Pile subsets (average) before and after fine-tuning with different amounts of specialization data. Fine-tuning is necessary to reach good specialized perplexity for all models. With 1m specialization tokens, SLM-is competes with the LLM.
Figure 5: Distillation results (dashed lines) on the 1M token specialization set for various teacher pretraining budgets. On the left we show perplexity with respect to the student pretraining cost only and on the right with respect to the overall pretraining cost. The cost of distillation is high when compared to its benefit compared to SLM-mix, SLM-pn.
...and 9 more figures

Need a Small Specialized Language Model? Plan Early!

TL;DR

Abstract

Need a Small Specialized Language Model? Plan Early!

Authors

TL;DR

Abstract

Table of Contents

Figures (14)