MultiReQA: A Cross-Domain Evaluation for Retrieval Question Answering Models
Mandy Guo, Yinfei Yang, Daniel Cer, Qinlan Shen, Noah Constant
TL;DR
MultiReQA introduces a cross-domain retrieval QA evaluation across eight public datasets, benchmarking BM25, a BERT dual encoder, and USE-QA on sentence-level answer retrieval. Across in-domain and out-of-domain splits, BM25 often remains competitive or superior, while neural models shine on low-overlap or short-context tasks when fine-tuned on in-domain data, illustrating the importance of domain-specific training. The study highlights that large-scale pretraining is crucial for neural models, though specialized QA pretraining is not strictly necessary, and that cross-domain transfer has nuanced effects depending on dataset characteristics. Overall, the work provides a comprehensive, scalable framework for evaluating retrieval QA across diverse domains and offers practical guidance for balancing traditional IR baselines with neural approaches in open-domain settings.
Abstract
Retrieval question answering (ReQA) is the task of retrieving a sentence-level answer to a question from an open corpus (Ahmad et al.,2019).This paper presents MultiReQA, anew multi-domain ReQA evaluation suite com-posed of eight retrieval QA tasks drawn from publicly available QA datasets. We provide the first systematic retrieval based evaluation over these datasets using two supervised neural models, based on fine-tuning BERT andUSE-QA models respectively, as well as a surprisingly strong information retrieval baseline,BM25. Five of these tasks contain both train-ing and test data, while three contain test data only. Performance on the five tasks with train-ing data shows that while a general model covering all domains is achievable, the best performance is often obtained by training exclusively on in-domain data.
