Task-Adaptive Pretrained Language Models via Clustered-Importance Sampling

David Grangier; Simin Fan; Skyler Seto; Pierre Ablin

Task-Adaptive Pretrained Language Models via Clustered-Importance Sampling

David Grangier, Simin Fan, Skyler Seto, Pierre Ablin

TL;DR

The paper addresses building task-specific language models when specialist data are scarce by reweighting generalist pretraining data. It introduces CRISP, a cluster-based importance sampling approach that aligns the generalist data distribution with the specialist distribution via cluster-level weights, enabling efficient pretraining, continued pretraining, and multitask setups. Empirical results show robust perplexity and accuracy gains across language modeling and multiple-choice QA tasks, with analysis guiding choices of clustering representation, cluster counts, and model size. The findings highlight CRISP’s scalability and practical value for producing domain-focused LMs without requiring large domain-specific corpora, and point to extensions to other modalities.

Abstract

Specialist language models (LMs) focus on a specific task or domain on which they often outperform generalist LMs of the same size. However, the specialist data needed to pretrain these models is only available in limited amount for most tasks. In this work, we build specialist models from large generalist training sets instead. We propose a novel method, ClusteRed Importance SamPling (CRISP). CRISP clusters the generalist dataset and samples from these clusters based on their frequencies in the smaller specialist dataset. It is scalable, suitable for both pretraining and continued pretraining, and works well in multi-task settings. CRISP performs favorably compared to other methods that adjust the training distribution of the generalist data with guidance from the limited domain-specific data. Our findings demonstrate improvements across different domains in terms of language modeling perplexity and accuracy on multiple-choice question tasks. We also present ablation studies that examine the impact of dataset sizes, clustering configurations, and model sizes.

Task-Adaptive Pretrained Language Models via Clustered-Importance Sampling

TL;DR

Abstract

Paper Structure (27 sections, 8 equations, 10 figures, 15 tables, 1 algorithm)

This paper contains 27 sections, 8 equations, 10 figures, 15 tables, 1 algorithm.

Introduction
Related Work
Data Selection for Task-Adaptive Pretraining
Notations
Classification
Gradient-Alignment
CRISP: ClusteRed Importance Sampling for Pretraining
Experiments & Results
Language Modeling Tasks
Multiple Choice Question Tasks
Analysis
Clustering
Model Size
Different Amount of Training Data
Task-Transfer and Multitasking
...and 12 more sections

Figures (10)

Figure 1: Task-adaptive data selection with Clustered Importance Sampling (CRISP).
Figure 2: Pretraining perplexities for language modeling tasks
Figure 5: Accuracy for multiple choice question tasks varying the text representation for clustering and the number of clusters. SBERT is more effective than LSI in all cases.
Figure 6: Number of occurrences of each training example for CRISP on MMLU. Repeated examples increase with the number of clusters.
Figure 7: Loss improvement on Redpj2 (valid) wrt base as a function of the SBERT distance to MMLU train. Models with a large number of clusters are better than base in a small area near MMLU train. The gray area indicates the $25$-$75\%$ quantiles for the MMLU test set.
...and 5 more figures

Task-Adaptive Pretrained Language Models via Clustered-Importance Sampling

TL;DR

Abstract

Task-Adaptive Pretrained Language Models via Clustered-Importance Sampling

Authors

TL;DR

Abstract

Table of Contents

Figures (10)