Table of Contents
Fetching ...

Target-Aware Language Modeling via Granular Data Sampling

Ernie Chang, Pin-Jie Lin, Yang Li, Changsheng Zhao, Daeil Kim, Rastislav Rabatin, Zechun Liu, Yangyang Shi, Vikas Chandra

TL;DR

This work tackles domain-specific pretraining by introducing a target-aware data sampling framework based on multi-granular n-gram features. By encoding documents into fixed-size feature vectors and applying importance sampling, the authors selectively curate data from large corpora to align with target tasks, while tokenizer adaptation further reduces domain bias. Across eight benchmarks and model sizes from 125M to 1.5B, the approach achieves notable efficiency and performance gains, with multi-granular sampling outperforming single-granular baselines and preserving non-target task capabilities. The findings suggest that careful data selection, aided by token granularity and vocabulary optimization, can substantially improve targeted down-stream performance for smaller language models, with potential for scaling to larger models and datasets.

Abstract

Language model pretraining generally targets a broad range of use cases and incorporates data from diverse sources. However, there are instances where we desire a model that excels in specific areas without markedly compromising performance in other areas. A cost-effective and straightforward approach is sampling with low-dimensional data features, which allows to select large-scale pretraining data for domain-specific use cases. In this work, we revisit importance sampling with n-gram features consisting of multi-granular tokens, which strikes a good balance between sentence compression and representation capabilities. We observed the sampled data to have a high correlation with the target downstream task performance while preserving its effectiveness on other tasks. This leads to the proposed data sampling paradigm where language models can be pretrained more efficiently on selected documents. On eight benchmarks we demonstrate with $\sim$1% of the data, pretrained models perform on par with the full RefinedWeb data and outperform randomly selected samples for model sizes ranging from 125M to 1.5B.

Target-Aware Language Modeling via Granular Data Sampling

TL;DR

This work tackles domain-specific pretraining by introducing a target-aware data sampling framework based on multi-granular n-gram features. By encoding documents into fixed-size feature vectors and applying importance sampling, the authors selectively curate data from large corpora to align with target tasks, while tokenizer adaptation further reduces domain bias. Across eight benchmarks and model sizes from 125M to 1.5B, the approach achieves notable efficiency and performance gains, with multi-granular sampling outperforming single-granular baselines and preserving non-target task capabilities. The findings suggest that careful data selection, aided by token granularity and vocabulary optimization, can substantially improve targeted down-stream performance for smaller language models, with potential for scaling to larger models and datasets.

Abstract

Language model pretraining generally targets a broad range of use cases and incorporates data from diverse sources. However, there are instances where we desire a model that excels in specific areas without markedly compromising performance in other areas. A cost-effective and straightforward approach is sampling with low-dimensional data features, which allows to select large-scale pretraining data for domain-specific use cases. In this work, we revisit importance sampling with n-gram features consisting of multi-granular tokens, which strikes a good balance between sentence compression and representation capabilities. We observed the sampled data to have a high correlation with the target downstream task performance while preserving its effectiveness on other tasks. This leads to the proposed data sampling paradigm where language models can be pretrained more efficiently on selected documents. On eight benchmarks we demonstrate with 1% of the data, pretrained models perform on par with the full RefinedWeb data and outperform randomly selected samples for model sizes ranging from 125M to 1.5B.
Paper Structure (16 sections, 3 equations, 6 figures, 3 tables)

This paper contains 16 sections, 3 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Multi-granular tokenization for more modular feature vectors used in importance sampling. (1) Given a document $d_{i}$, it undergoes featurization as a sequence of multi-granular tokens. (2) Subsequently, the document is transformed into a fixed-sized feature representation via hashing N-grams. (3) We measure its significance through the enhancement weight $w_i$ and select a subset of $K$ representative data points from the original target distributions through re-sampling
  • Figure 2: The plot of average KL reduction and the performance on HellaSwag. We measure how the granularity of tokens used used for coreset selection reduces KL divergence to the target distribution compared to random sampling from The RefinedWeb, suggesting a strong correlation between KL reduction and downstream performance (Pearson $r=0.82$).
  • Figure 3: Zero-shot performances averaged over eight tasks computed across all model sizes, where emergent characteristic can be observed at the model of size 350M parameters.
  • Figure 4: Comparison of Multi-granular n-grams with N-gram and Random baseline across eight tasks using 125M models, trained solely on data selected with ARC-Easy data as the target. Relative performance is used. We observe that multi-granular features enable the model to consistently outperform the baseline despite task-specific biases in the data. We performed similar experiments for all other benchmarks in the appendix, where for all eight tasks, the same pattern is observed that multi-granular n-grams yield almost no degradation across benchmarks.
  • Figure 5: Comparison of Multi-granular n-grams with N-gram and Random baseline using 125M models, trained solely on data selected with HellaSwag, OBQA, WinoGrande data as the target respectively.
  • ...and 1 more figures