The interplay between domain specialization and model size

Roseval Malaquias Junior; Ramon Pires; Thales Sales Almeida; Kenzo Sakiyama; Roseli A. F. Romero; Rodrigo Nogueira

The interplay between domain specialization and model size

Roseval Malaquias Junior, Ramon Pires, Thales Sales Almeida, Kenzo Sakiyama, Roseli A. F. Romero, Rodrigo Nogueira

TL;DR

This work investigates the interplay between domain specialization and model size during continued pretraining under compute-constrained scenarios and finds patterns in this interplay that can be generalized across different model sizes and domains.

Abstract

Scaling laws for language models have often focused on finding the optimal model size and token count for training from scratch. However, achieving this optimal balance requires significant compute resources due to the extensive data demands when training models from randomly-initialized weights. Continued pretraining offers a cost-effective alternative, leveraging the compute investment from pretrained models to incorporate new knowledge without requiring extensive new data. Recent findings suggest that data quality influences constants in scaling laws, thereby altering the optimal parameter-token allocation ratio. Building on this insight, we investigate the interplay between domain specialization and model size during continued pretraining under compute-constrained scenarios. Our goal is to identify an optimal training regime for this scenario and detect patterns in this interplay that can be generalized across different model sizes and domains. To compare general and specialized training, we filtered a web-based dataset to extract data from three domains: legal, medical, and accounting. We pretrained models with 1.5B, 3B, 7B, and 14B parameters on both the unfiltered and filtered datasets, then evaluated their performance on domain-specific exams. Results show that as model size increases, specialized models outperform general models while requiring less training compute. Additionally, their growing compute efficiency leads to reduced forgetting of previously learned knowledge.

The interplay between domain specialization and model size

TL;DR

Abstract

Paper Structure (15 sections, 3 equations, 5 figures, 1 table)

This paper contains 15 sections, 3 equations, 5 figures, 1 table.

Introduction
Related work
Continual pretraining on domain-specific data
Scaling laws
Methodology
Pretraining data
Evaluation
Experimental setup
Results and discussion
Specialized models achieve lower perplexity on target domain
Specialized models present increasing sample-efficiency
Specialized models present diminishing forgetting
Specialized models exhibit higher perplexity on general domain
Conclusion
Limitations

Figures (5)

Figure 1: Each point represents the checkpoint with the lowest perplexity in a held-out legal test suite for either legal, general, or base models. In (a), a power-law relationship is observed: as model size increases, specialized models consistently outperform general models. In (b), the compute-effectiveness gap between specialized and general models is quantified using the SGER metric, showing that as model size increases specialized models achieve their lowest perplexity with less training steps.
Figure 2: Each point represents the checkpoint with the lowest perplexity on a held-out, domain-specific test suite for medical, accounting, general, or base models.
Figure 3: Optimal model size for a given amount of available compute, comparing legal and general models on the legal test suite.
Figure 4: Perplexity on the original general knowledge test suite vs. model size.
Figure 5: Perplexity on the new general knowledge test suite vs. model size.

The interplay between domain specialization and model size

TL;DR

Abstract

The interplay between domain specialization and model size

Authors

TL;DR

Abstract

Table of Contents

Figures (5)