$FastDoc$: Domain-Specific Fast Continual Pre-training Technique using Document-Level Metadata and Taxonomy

Abhilash Nandy; Manav Nitin Kapadnis; Sohan Patnaik; Yash Parag Butala; Pawan Goyal; Niloy Ganguly

$FastDoc$: Domain-Specific Fast Continual Pre-training Technique using Document-Level Metadata and Taxonomy

Abhilash Nandy, Manav Nitin Kapadnis, Sohan Patnaik, Yash Parag Butala, Pawan Goyal, Niloy Ganguly

TL;DR

A novel, compute-efficient framework that utilizes Document metadata and Domain-Specific Taxonomy as supervision signals to continually pre-train transformer encoder on a domain-specific corpus, which shows a negligible drop in performance on open domain.

Abstract

In this paper, we propose $FastDoc$ (Fast Continual Pre-training Technique using Document Level Metadata and Taxonomy), a novel, compute-efficient framework that utilizes Document metadata and Domain-Specific Taxonomy as supervision signals to continually pre-train transformer encoder on a domain-specific corpus. The main innovation is that during domain-specific pretraining, an open-domain encoder is continually pre-trained using sentence-level embeddings as inputs (to accommodate long documents), however, fine-tuning is done with token-level embeddings as inputs to this encoder. We perform such domain-specific pre-training on three different domains namely customer support, scientific, and legal domains, and compare performance on 6 different downstream tasks and 9 different datasets. The novel use of document-level supervision along with sentence-level embedding input for pre-training reduces pre-training compute by around $1,000$, $4,500$, and $500$ times compared to MLM and/or NSP in Customer Support, Scientific, and Legal Domains, respectively. The reduced training time does not lead to a deterioration in performance. In fact we show that $FastDoc$ either outperforms or performs on par with several competitive transformer-based baselines in terms of character-level F1 scores and other automated metrics in the Customer Support, Scientific, and Legal Domains. Moreover, reduced training aids in mitigating the risk of catastrophic forgetting. Thus, unlike baselines, $FastDoc$ shows a negligible drop in performance on open domain.

$FastDoc$: Domain-Specific Fast Continual Pre-training Technique using Document-Level Metadata and Taxonomy

TL;DR

Abstract

In this paper, we propose

(Fast Continual Pre-training Technique using Document Level Metadata and Taxonomy), a novel, compute-efficient framework that utilizes Document metadata and Domain-Specific Taxonomy as supervision signals to continually pre-train transformer encoder on a domain-specific corpus. The main innovation is that during domain-specific pretraining, an open-domain encoder is continually pre-trained using sentence-level embeddings as inputs (to accommodate long documents), however, fine-tuning is done with token-level embeddings as inputs to this encoder. We perform such domain-specific pre-training on three different domains namely customer support, scientific, and legal domains, and compare performance on 6 different downstream tasks and 9 different datasets. The novel use of document-level supervision along with sentence-level embedding input for pre-training reduces pre-training compute by around

, and

times compared to MLM and/or NSP in Customer Support, Scientific, and Legal Domains, respectively. The reduced training time does not lead to a deterioration in performance. In fact we show that

either outperforms or performs on par with several competitive transformer-based baselines in terms of character-level F1 scores and other automated metrics in the Customer Support, Scientific, and Legal Domains. Moreover, reduced training aids in mitigating the risk of catastrophic forgetting. Thus, unlike baselines,

shows a negligible drop in performance on open domain.

Paper Structure (41 sections, 3 equations, 8 figures, 30 tables)

This paper contains 41 sections, 3 equations, 8 figures, 30 tables.

Introduction
FastDoc Framework
Pre-training Setup
Pre-training in the Customer Support Domain
Pre-training in the Scientific Domain
Pre-training in the Legal Domain
Downstream Datasets/Tasks
Customer Support
Scientific Domain
Legal Domain
Experiments and Results
Customer Support Domain
Scientific Domain
Legal Domain
Utility of the pre-training losses: Examples
...and 26 more sections

Figures (8)

Figure 1: End-to-end training pipeline using FastDoc
Figure 2: Relative change (in $Log_{10}$ Scale) in the L1-norm of different types of parameters during pre-training via MLM vs. FastDoc.
Figure 3: Percentage of documents encoded entirely by RoBERTa-BASE encoder when the input is $512$ tokens vs. $512$ sentences (The red bar of "Scientific Domain" has a negligible height, and is hence, not visible.)
Figure 4: Depiction of FastDoc. Anchor, Positive, and Negative Documents are encoded using a Sentence Transformer, followed by a transformer encoder, to give document representations. A combination of Triplet and Hierarchical Classification Losses is used to get the Total Loss
Figure 5: FastDoc$(Q)$ Pre-training Architecture - It is similar to that of FastDoc. Differences are - (1) Instead of using a Triplet Loss Function (as in FastDoc), a Quadruplet Loss Function is used in this case. (2) Anchor, Near Positive, Far Positive, and Negative Documents are taken as inputs.
...and 3 more figures

$FastDoc$: Domain-Specific Fast Continual Pre-training Technique using Document-Level Metadata and Taxonomy

TL;DR

Abstract

$FastDoc$: Domain-Specific Fast Continual Pre-training Technique using Document-Level Metadata and Taxonomy

Authors

TL;DR

Abstract

Table of Contents

Figures (8)