Table of Contents
Fetching ...

Measuring the Robustness of NLP Models to Domain Shifts

Nitay Calderon, Naveh Porat, Eyal Ben-David, Alexander Chapanin, Zorik Gekhman, Nadav Oved, Vitaly Shalumov, Roi Reichart

TL;DR

This work reframes domain robustness (DR) evaluation by introducing a Target Drop ($TD$) metric alongside the conventional Source Drop ($SD$) and by constructing a natural-domain DR benchmark spanning seven NLP tasks. A large-scale study across 21 models and over 14,000 domain shifts reveals that both fine-tuned models and few-shot LLMs experience cross-domain degradation, but few-shot LLMs often exhibit stronger cross-domain robustness; importantly, large $SD$ can stem from shifts to harder target domains rather than genuine DR challenges, underscoring the value of the TD metric. The findings advocate evaluating multiple domain shifts and both $SD$ and $TD$ to avoid biased conclusions about DR, with implications for model selection, evaluation practices, and domain adaptation research. The paper also analyzes the relationship between domain divergence and DR, showing that $TD$ provides a more reliable estimator of average DR, and discusses practical guidance for practitioners and researchers building robust NLP systems.

Abstract

Existing research on Domain Robustness (DR) suffers from disparate setups, limited task variety, and scarce research on recent capabilities such as in-context learning. Furthermore, the common practice of measuring DR might not be fully accurate. Current research focuses on challenge sets and relies solely on the Source Drop (SD): Using the source in-domain performance as a reference point for degradation. However, we argue that the Target Drop (TD), which measures degradation from the target in-domain performance, should be used as a complementary point of view. To address these issues, we first curated a DR benchmark comprised of 7 diverse NLP tasks, which enabled us to measure both the SD and the TD. We then conducted a comprehensive large-scale DR study involving over 14,000 domain shifts across 21 fine-tuned models and few-shot LLMs. We found that both model types suffer from drops upon domain shifts. While fine-tuned models excel in-domain, few-shot LLMs often surpass them cross-domain, showing better robustness. In addition, we found that a large SD can often be explained by shifting to a harder domain rather than by a genuine DR challenge, and this highlights the importance of TD as a complementary metric. We hope our study will shed light on the current DR state of NLP models and promote improved evaluation practices toward more robust models.

Measuring the Robustness of NLP Models to Domain Shifts

TL;DR

This work reframes domain robustness (DR) evaluation by introducing a Target Drop () metric alongside the conventional Source Drop () and by constructing a natural-domain DR benchmark spanning seven NLP tasks. A large-scale study across 21 models and over 14,000 domain shifts reveals that both fine-tuned models and few-shot LLMs experience cross-domain degradation, but few-shot LLMs often exhibit stronger cross-domain robustness; importantly, large can stem from shifts to harder target domains rather than genuine DR challenges, underscoring the value of the TD metric. The findings advocate evaluating multiple domain shifts and both and to avoid biased conclusions about DR, with implications for model selection, evaluation practices, and domain adaptation research. The paper also analyzes the relationship between domain divergence and DR, showing that provides a more reliable estimator of average DR, and discusses practical guidance for practitioners and researchers building robust NLP systems.

Abstract

Existing research on Domain Robustness (DR) suffers from disparate setups, limited task variety, and scarce research on recent capabilities such as in-context learning. Furthermore, the common practice of measuring DR might not be fully accurate. Current research focuses on challenge sets and relies solely on the Source Drop (SD): Using the source in-domain performance as a reference point for degradation. However, we argue that the Target Drop (TD), which measures degradation from the target in-domain performance, should be used as a complementary point of view. To address these issues, we first curated a DR benchmark comprised of 7 diverse NLP tasks, which enabled us to measure both the SD and the TD. We then conducted a comprehensive large-scale DR study involving over 14,000 domain shifts across 21 fine-tuned models and few-shot LLMs. We found that both model types suffer from drops upon domain shifts. While fine-tuned models excel in-domain, few-shot LLMs often surpass them cross-domain, showing better robustness. In addition, we found that a large SD can often be explained by shifting to a harder domain rather than by a genuine DR challenge, and this highlights the importance of TD as a complementary metric. We hope our study will shed light on the current DR state of NLP models and promote improved evaluation practices toward more robust models.
Paper Structure (34 sections, 1 theorem, 10 equations, 12 figures, 11 tables)

This paper contains 34 sections, 1 theorem, 10 equations, 12 figures, 11 tables.

Key Result

Theorem 1

Let $(S,T)$ be different source and target domains sampled independently from the domain space, and let $(\textcolor{ss_orange}{\mathrm{SS}}, \textcolor{tt_blue}{\mathrm{TT}}, \textcolor{st_green}{\mathrm{ST}})$ be RVs of their performances. The following are equivalent:

Figures (12)

  • Figure 1: Illustration of the four domain shift scenarios. In the Classic and Observed scenarios, we observe a 15-point drop between the Source In-domain Performance ($\textcolor{ss_orange}{\mathrm{SS}}$) and the Cross-domain Performance ($\textcolor{st_green}{\mathrm{ST}}$). Conversely, in the Unobserved and No Challenge scenarios, $\textcolor{ss_orange}{\mathrm{SS}}=70$ and $\textcolor{st_green}{\mathrm{ST}}=85$, meaning the model gains 15 points upon domain shift. We would typically conclude that there is a DR challenge only in the first two scenarios. However, we argue that this commonly adopted perspective is inaccurate since it overlooks the Target In-domain Performance ($\textcolor{tt_blue}{\mathrm{TT}}$). Our work provides a fresh perspective by considering both degradation metrics: The Source Drop ($\textcolor{sd_orange}{\mathrm{SD}}$) and the Target Drop ($\textcolor{td_blue}{\mathrm{TD}}$).
  • Figure 2: Average $\textcolor{sd_orange}{\mathrm{SD}}$ (orange lines) and Average $\textcolor{td_blue}{\mathrm{TD}}$ (blue lines) as a function of challenging domain shifts. Specifically, we sort the domain shifts by their In-domain Difference ($\mathrm{IDD}$) and as we move to the right on the x-axis, we incrementally include an additional domain shift in the average drop calculation. Consequently, the leftmost point represents the shift with the largest IDD, while the rightmost point encompasses all shifts. The best fine-tuned model (see caption of Table \ref{['tab:main_fs_results']}, solid lines) against GPT4 (dashed lines). This figure illustrates three key findings: (1) The $\textcolor{sd_orange}{\mathrm{SD}}$ is larger than the $\textcolor{td_blue}{\mathrm{TD}}$, and when including all shifts their averages are equal; (2) Generally, fine-tuned models exhibit larger drops; (3) Examining only challenging shifts and focusing solely on the $\textcolor{sd_orange}{\mathrm{SD}}$, obscure the true DR state. Incorporating the $\textcolor{td_blue}{\mathrm{TD}}$ can compensate for this and provide a clearer understanding.
  • Figure 3: The proportion of each domain shift scenario (see §\ref{['sub:scenarios']}) for fine-tuned (top chart) and few-shot models (bottom). For each task, the proportion is measured over all the models and domain shifts. More details in §\ref{['sub:scenarios_stats']}.
  • Figure 4: Fine-tuning performance for the seven tasks of different models with varying sizes. The plots present the F1 and BertScore scores of the average in-domain (black line) and cross-domain (green line) performance. In addition, the highest in-domain score (orange line) and the lowest cross-domain score (blue line) are displayed.
  • Figure 5: Fine-tuning drops of DeBERTa and T5 families. The plots present: The Average Drop (green bars); The Worst $\textcolor{sd_orange}{\mathrm{SD}}$ (orange bars); and the Worst $\textcolor{td_blue}{\mathrm{TD}}$ (blue bars). The lines on the bars present the Average Worst $\textcolor{sd_orange}{\mathrm{SD}}$ and $\textcolor{td_blue}{\mathrm{TD}}$, i.e., for each source domain we first find the worst drop and then take the average over all source domains.
  • ...and 7 more figures

Theorems & Definitions (4)

  • Theorem 1
  • Remark 1
  • Remark 2
  • proof