Table of Contents
Fetching ...

Exploring Language Model Generalization in Low-Resource Extractive QA

Saptarshi Sengupta, Wenpeng Yin, Preslav Nakov, Shreya Ghosh, Suhang Wang

TL;DR

This work investigates zero-shot generalization of extractive QA to closed domains (e.g., medicine and law) using a broad set of architectures across five datasets. It jointly analyzes model factors (answer length, polysemy, architecture, and tokenization) and dataset factors (similarity to SQuAD and dataset perplexity) to explain cross-domain failures. Key findings show that scaling alone does not ensure transfer, and mismatches in answer-length distributions, domain sense discrimination, tokenization, and dataset similarity drive performance gaps; aspects like SentencePiece Unigram tokenization and whole-word masking can mitigate some of these gaps. The results offer actionable guidance for building and evaluating cross-domain EQA systems, highlighting directions for tokenization choices, prompt design, and leveraging dataset similarity as a predictor of transfer success.

Abstract

In this paper, we investigate Extractive Question Answering (EQA) with Large Language Models (LLMs) under domain drift, i.e., can LLMs generalize to domains that require specific knowledge such as medicine and law in a zero-shot fashion without additional in-domain training? To this end, we devise a series of experiments to explain the performance gap empirically. Our findings suggest that: (a) LLMs struggle with dataset demands of closed domains such as retrieving long answer spans; (b) Certain LLMs, despite showing strong overall performance, display weaknesses in meeting basic requirements as discriminating between domain-specific senses of words which we link to pre-processing decisions; (c) Scaling model parameters is not always effective for cross domain generalization; and (d) Closed-domain datasets are quantitatively much different than open-domain EQA datasets and current LLMs struggle to deal with them. Our findings point out important directions for improving existing LLMs.

Exploring Language Model Generalization in Low-Resource Extractive QA

TL;DR

This work investigates zero-shot generalization of extractive QA to closed domains (e.g., medicine and law) using a broad set of architectures across five datasets. It jointly analyzes model factors (answer length, polysemy, architecture, and tokenization) and dataset factors (similarity to SQuAD and dataset perplexity) to explain cross-domain failures. Key findings show that scaling alone does not ensure transfer, and mismatches in answer-length distributions, domain sense discrimination, tokenization, and dataset similarity drive performance gaps; aspects like SentencePiece Unigram tokenization and whole-word masking can mitigate some of these gaps. The results offer actionable guidance for building and evaluating cross-domain EQA systems, highlighting directions for tokenization choices, prompt design, and leveraging dataset similarity as a predictor of transfer success.

Abstract

In this paper, we investigate Extractive Question Answering (EQA) with Large Language Models (LLMs) under domain drift, i.e., can LLMs generalize to domains that require specific knowledge such as medicine and law in a zero-shot fashion without additional in-domain training? To this end, we devise a series of experiments to explain the performance gap empirically. Our findings suggest that: (a) LLMs struggle with dataset demands of closed domains such as retrieving long answer spans; (b) Certain LLMs, despite showing strong overall performance, display weaknesses in meeting basic requirements as discriminating between domain-specific senses of words which we link to pre-processing decisions; (c) Scaling model parameters is not always effective for cross domain generalization; and (d) Closed-domain datasets are quantitatively much different than open-domain EQA datasets and current LLMs struggle to deal with them. Our findings point out important directions for improving existing LLMs.
Paper Structure (28 sections, 3 equations, 10 figures, 10 tables)

This paper contains 28 sections, 3 equations, 10 figures, 10 tables.

Figures (10)

  • Figure 1: We attempt to explain the performance drop when a model is trained using in-domain (ID) datasets (SQuAD; pink) and tested on ID data (SQuAD) v/s OOD (out-of-domain) data (blue).
  • Figure 2: Proposed Experiments. *We provided a detailed analysis of causal LLMs in Appendix \ref{['app:Autoregressive Models']} and discuss why they are suboptimal for EQA.
  • Figure 3: FDA Plot. Each bar represents FDA similarity between SQuAD and the corresponding OOD dataset.
  • Figure 4: Scatter plot with trend line between model perplexity and performance (F1). Pearson correlation between F1 and PPL. (clockwise from top left) for BERT: -0.17, RoBERTa: -0.48, Falcon: -0.77, Platypus: -0.9.
  • Figure 5: Answer length distribution for BiDAF and RoBERTa on SQuAD (top) and TechQA (bottom).
  • ...and 5 more figures

Theorems & Definitions (1)

  • Definition 1