Improving Domain Adaptation through Extended-Text Reading Comprehension

Ting Jiang; Shaohan Huang; Shengyue Luo; Zihan Zhang; Haizhen Huang; Furu Wei; Weiwei Deng; Feng Sun; Qi Zhang; Deqing Wang; Fuzhen Zhuang

Improving Domain Adaptation through Extended-Text Reading Comprehension

Ting Jiang, Shaohan Huang, Shengyue Luo, Zihan Zhang, Haizhen Huang, Furu Wei, Weiwei Deng, Feng Sun, Qi Zhang, Deqing Wang, Fuzhen Zhuang

TL;DR

This work addresses domain adaptation for large language models by moving beyond regex-based AdaptLLM preprocessing to a three-pronged framework: (i) LLM-based generation of high-quality question-answer pairs from domain corpora, (ii) length-based clustering to extend context and enrich comprehension, and (iii) parameter-efficient fine-tuning using LoRA. The approach yields consistent improvements over AdaptLLM, with average gains of $6.8\%$ in biomedicine and $5.6\%$ in finance, and demonstrates enhanced RAG performance through extended context. It leverages a distilled 7B LLM for scalable QA data generation and shows that LoRA with rank $256$ can match full fine-tuning efficiency while maintaining strong domain-specific performance, aided by int8 quantization. Overall, the method provides a scalable, cost-effective pathway to improve domain-specific capabilities of LLMs on large, unsupervised corpora.

Abstract

To enhance the domain-specific capabilities of large language models, continued pre-training on a domain-specific corpus is a prevalent method. Recent work demonstrates that adapting models using reading comprehension data formatted by regex-based patterns can significantly improve performance on domain-specific tasks. However, regex-based patterns are incapable of parsing raw corpora using domain-specific knowledge. Furthermore, the question and answer pairs are extracted directly from the corpus in predefined formats offers limited context. To address this limitation, we improve reading comprehension via LLM and clustering. LLM focuses on leveraging domain knowledge within the corpus to refine comprehension stage, while clustering supplies relevant knowledge by extending the context to enrich reading stage. Additionally, our method incorporates parameter-efficient fine-tuning to improve the efficiency of domain adaptation. In comparison to AdaptLLM, our method achieves an improvement exceeding 5% in domain-specific tasks. Our code will available at https://github.com/microsoft/LMOps.

Improving Domain Adaptation through Extended-Text Reading Comprehension

TL;DR

in biomedicine and

in finance, and demonstrates enhanced RAG performance through extended context. It leverages a distilled 7B LLM for scalable QA data generation and shows that LoRA with rank

can match full fine-tuning efficiency while maintaining strong domain-specific performance, aided by int8 quantization. Overall, the method provides a scalable, cost-effective pathway to improve domain-specific capabilities of LLMs on large, unsupervised corpora.

Abstract

Paper Structure (12 sections, 2 figures, 3 tables, 1 algorithm)

This paper contains 12 sections, 2 figures, 3 tables, 1 algorithm.

Introduction
Methods
LLM-based data Preprocessing
Length-based Clustering
Parameter Efficient Domain Adaptation
Experiments
Experiment Settings
Main Results
Ablation Study
Effect of Clustering
Effect of Parameter Efficient Fine-Tuning
Conclusion

Figures (2)

Figure 1: The overall framework of our method. Best view in color.
Figure 2: Ablation study on clustering on biomedicine with DAPT, ReadCompre and our method.

Improving Domain Adaptation through Extended-Text Reading Comprehension

TL;DR

Abstract

Improving Domain Adaptation through Extended-Text Reading Comprehension

Authors

TL;DR

Abstract

Table of Contents

Figures (2)