TextGram: Towards a better domain-adaptive pretraining

Sharayu Hiwarkhedkar; Saloni Mittal; Vidula Magdum; Omkar Dhekane; Raviraj Joshi; Geetanjali Kale; Arnav Ladkat

TextGram: Towards a better domain-adaptive pretraining

Sharayu Hiwarkhedkar, Saloni Mittal, Vidula Magdum, Omkar Dhekane, Raviraj Joshi, Geetanjali Kale, Arnav Ladkat

TL;DR

This work tackles the environmental and computational costs of pretraining large transformer models by evaluating multiple data-selection strategies for domain-adaptive pretraining. It systematically compares N-grams, TF-IDF, perplexity, cross-entropy, TextRank, and Random Selection, and introduces TextGram, a TextRank-based method enhanced with in-domain n-gram cues and paraphrase mining to rank and select informative sentences. Empirically, data selection improves downstream text classification performance, with TextGram delivering the strongest results among the tested approaches while using a smaller, more domain-relevant pretraining corpus. The findings suggest that carefully selecting domain-relevant data can reduce training time and energy consumption without sacrificing accuracy, enabling more sustainable domain adaptation for NLP models.

Abstract

For green AI, it is crucial to measure and reduce the carbon footprint emitted during the training of large language models. In NLP, performing pre-training on Transformer models requires significant computational resources. This pre-training involves using a large amount of text data to gain prior knowledge for performing downstream tasks. Thus, it is important that we select the correct data in the form of domain-specific data from this vast corpus to achieve optimum results aligned with our domain-specific tasks. While training on large unsupervised data is expensive, it can be optimized by performing a data selection step before pretraining. Selecting important data reduces the space overhead and the substantial amount of time required to pre-train the model while maintaining constant accuracy. We investigate the existing selection strategies and propose our own domain-adaptive data selection method - TextGram - that effectively selects essential data from large corpora. We compare and evaluate the results of finetuned models for text classification task with and without data selection. We show that the proposed strategy works better compared to other selection methods.

TextGram: Towards a better domain-adaptive pretraining

TL;DR

Abstract

Paper Structure (19 sections, 10 equations, 2 figures, 3 tables, 1 algorithm)

This paper contains 19 sections, 10 equations, 2 figures, 3 tables, 1 algorithm.

Introduction
Motivation
Related Work
Experimentation Setup
Datasets
Model Architecture
Data Selection Techniques
N-grams Coverage
TF-IDF based selection
Perplexity Based Data Selection
Perplexity as the normalized inverse probability of the test set
Cross Entropy
TextRank
Random Selection
Proposed technique - TextGram
...and 4 more sections

Figures (2)

Figure 1: High-Level System Architecture Diagram - The corpus (both in-domain and out-domain) is first fed into pre-processing pipeline which will prepare the data for selection. Then, the selection strategy will be applied that will select data from out-domain based on in-domain training set. Further, selected corpus is used for continuous pre-training of BERT model. After pre-training, we perform fine-tuning to adapt the model on in-domain corpus.
Figure 2: Architecture diagram: TextGram based ranking

TextGram: Towards a better domain-adaptive pretraining

TL;DR

Abstract

TextGram: Towards a better domain-adaptive pretraining

Authors

TL;DR

Abstract

Table of Contents

Figures (2)