Word Matters: What Influences Domain Adaptation in Summarization?

Yinghao Li; Siyu Miao; Heyan Huang; Yang Gao

Word Matters: What Influences Domain Adaptation in Summarization?

Yinghao Li, Siyu Miao, Heyan Huang, Yang Gao

TL;DR

This work examines how word-level properties in training data influence domain adaptation for abstractive summarization. It introduces two learning-difficulty indicators, Compression Ratio $\alpha$ and Abstraction Level $\beta$, and combines them into a learning difficulty coefficient $\lambda = \alpha \beta$, to explain transfer performance. The authors show that cross-domain overlap $\gamma$ linearly relates to performance gains when accounting for $\lambda$ (LD-Gain), while total word count has no consistent effect. They further demonstrate a predictive framework where unseen-domain performance can be estimated from $\lambda$ and $\gamma$ without retraining, offering a resource-efficient way to anticipate domain transfer for new domains across multiple models and datasets. The findings suggest prioritizing word-level domain similarity over data volume and enable preemptive domain suitability assessment for LLM summarization systems.

Abstract

Domain adaptation aims to enable Large Language Models (LLMs) to generalize domain datasets unseen effectively during the training phase. However, factors such as the size of the model parameters and the scale of training data are general influencers and do not reflect the nuances of domain adaptation performance. This paper investigates the fine-grained factors affecting domain adaptation performance, analyzing the specific impact of `words' in training data on summarization tasks. We propose quantifying dataset learning difficulty as the learning difficulty of generative summarization, which is determined by two indicators: word-based compression rate and abstraction level. Our experiments conclude that, when considering dataset learning difficulty, the cross-domain overlap and the performance gain in summarization tasks exhibit an approximate linear relationship, which is not directly related to the number of words. Based on this finding, predicting a model's performance on unknown domain datasets is possible without undergoing training.

Word Matters: What Influences Domain Adaptation in Summarization?

TL;DR

This work examines how word-level properties in training data influence domain adaptation for abstractive summarization. It introduces two learning-difficulty indicators, Compression Ratio

and Abstraction Level

, and combines them into a learning difficulty coefficient

, to explain transfer performance. The authors show that cross-domain overlap

linearly relates to performance gains when accounting for

(LD-Gain), while total word count has no consistent effect. They further demonstrate a predictive framework where unseen-domain performance can be estimated from

and

without retraining, offering a resource-efficient way to anticipate domain transfer for new domains across multiple models and datasets. The findings suggest prioritizing word-level domain similarity over data volume and enable preemptive domain suitability assessment for LLM summarization systems.

Abstract

Paper Structure (32 sections, 7 equations, 9 figures, 6 tables)

This paper contains 32 sections, 7 equations, 9 figures, 6 tables.

Introduction
Related Work
Continual Pre-training
Alignment of Distributions
Adaptation Tuning
What and How does Word Influence Domain Adaptation?
Word Influence On Target Domain Dataset Learning Difficulty
Compression Ratio
Abstraction Level
Learning Difficulty Coefficient
Possible Impact Aspects Cross Different Domains Based on Words
Summarization Gain
Cross-domain Overlap
Word Count
How to influence?
...and 17 more sections

Figures (9)

Figure 1: Bloom-1.1B
Figure 2: Bloom-3B
Figure 3: Llama2-7B
Figure 5: Multi-Domain
Figure 6: Mixed-Domain
...and 4 more figures

Word Matters: What Influences Domain Adaptation in Summarization?

TL;DR

Abstract

Word Matters: What Influences Domain Adaptation in Summarization?

Authors

TL;DR

Abstract

Table of Contents

Figures (9)