Table of Contents
Fetching ...

Fine-tuning Strategies for Domain Specific Question Answering under Low Annotation Budget Constraints

Kunpeng Guo, Dennis Diefenbach, Antoine Gourru, Christophe Gravier

TL;DR

The paper addresses domain-specific extractive QA under low annotation budgets and demonstrates that the conventional sequential fine-tuning pipeline is sub-optimal. Through an exhaustive evaluation of 18 fine-tuning strategies (with/without MLM) across four domain datasets, it shows that mixing target-domain data with SQuAD during fine-tuning (merge-based approaches) yields robust gains without additional labeling, outperforming the baseline by $2.28\%$ to $6.48\%$ on average. Knowledge-Alignment Fine-tuning via MLM provides little to no reliable benefit in these low-budget settings, while the best-performing merge strategy (MWO) consistently improves performance, particularly at very small budgets where gains can be substantial (e.g., up to $12.5$ percentage points for KG-QA at $k=100$). The study provides practical guidance: for small budgets, use a carefully chosen merge-based strategy; for larger budgets, the benefit of exploring multiple strategies diminishes, suggesting a fixed robust approach suffices. The work also offers a complete experimental protocol and highlights the practical impact for QA practitioners facing annotation constraints.

Abstract

The progress introduced by pre-trained language models and their fine-tuning has resulted in significant improvements in most downstream NLP tasks. The unsupervised training of a language model combined with further target task fine-tuning has become the standard QA fine-tuning procedure. In this work, we demonstrate that this strategy is sub-optimal for fine-tuning QA models, especially under a low QA annotation budget, which is a usual setting in practice due to the extractive QA labeling cost. We draw our conclusions by conducting an exhaustive analysis of the performance of the alternatives of the sequential fine-tuning strategy on different QA datasets. Based on the experiments performed, we observed that the best strategy to fine-tune the QA model in low-budget settings is taking a pre-trained language model (PLM) and then fine-tuning PLM with a dataset composed of the target dataset and SQuAD dataset. With zero extra annotation effort, the best strategy outperforms the standard strategy by 2.28% to 6.48%. Our experiments provide one of the first investigations on how to best fine-tune a QA system under a low budget and are therefore of the utmost practical interest to the QA practitioners.

Fine-tuning Strategies for Domain Specific Question Answering under Low Annotation Budget Constraints

TL;DR

The paper addresses domain-specific extractive QA under low annotation budgets and demonstrates that the conventional sequential fine-tuning pipeline is sub-optimal. Through an exhaustive evaluation of 18 fine-tuning strategies (with/without MLM) across four domain datasets, it shows that mixing target-domain data with SQuAD during fine-tuning (merge-based approaches) yields robust gains without additional labeling, outperforming the baseline by to on average. Knowledge-Alignment Fine-tuning via MLM provides little to no reliable benefit in these low-budget settings, while the best-performing merge strategy (MWO) consistently improves performance, particularly at very small budgets where gains can be substantial (e.g., up to percentage points for KG-QA at ). The study provides practical guidance: for small budgets, use a carefully chosen merge-based strategy; for larger budgets, the benefit of exploring multiple strategies diminishes, suggesting a fixed robust approach suffices. The work also offers a complete experimental protocol and highlights the practical impact for QA practitioners facing annotation constraints.

Abstract

The progress introduced by pre-trained language models and their fine-tuning has resulted in significant improvements in most downstream NLP tasks. The unsupervised training of a language model combined with further target task fine-tuning has become the standard QA fine-tuning procedure. In this work, we demonstrate that this strategy is sub-optimal for fine-tuning QA models, especially under a low QA annotation budget, which is a usual setting in practice due to the extractive QA labeling cost. We draw our conclusions by conducting an exhaustive analysis of the performance of the alternatives of the sequential fine-tuning strategy on different QA datasets. Based on the experiments performed, we observed that the best strategy to fine-tune the QA model in low-budget settings is taking a pre-trained language model (PLM) and then fine-tuning PLM with a dataset composed of the target dataset and SQuAD dataset. With zero extra annotation effort, the best strategy outperforms the standard strategy by 2.28% to 6.48%. Our experiments provide one of the first investigations on how to best fine-tune a QA system under a low budget and are therefore of the utmost practical interest to the QA practitioners.
Paper Structure (16 sections, 2 figures, 6 tables)

This paper contains 16 sections, 2 figures, 6 tables.

Figures (2)

  • Figure 1: Mainstream methods for QA fine-tuning.
  • Figure 2: Relative performance gain after x16 data collection procedure evaluated over low budget ($K=100$) and high budget sizes ($K=1,600$).