Table of Contents
Fetching ...

NeurIPS 2020 EfficientQA Competition: Systems, Analyses and Lessons Learned

Sewon Min, Jordan Boyd-Graber, Chris Alberti, Danqi Chen, Eunsol Choi, Michael Collins, Kelvin Guu, Hannaneh Hajishirzi, Kenton Lee, Jennimaria Palomaki, Colin Raffel, Adam Roberts, Tom Kwiatkowski, Patrick Lewis, Yuxiang Wu, Heinrich Küttler, Linqing Liu, Pasquale Minervini, Pontus Stenetorp, Sebastian Riedel, Sohee Yang, Minjoon Seo, Gautier Izacard, Fabio Petroni, Lucas Hosseini, Nicola De Cao, Edouard Grave, Ikuya Yamada, Sonse Shimaoka, Masatoshi Suzuki, Shumpei Miyawaki, Shun Sato, Ryo Takahashi, Jun Suzuki, Martin Fajcik, Martin Docekal, Karel Ondrej, Pavel Smrz, Hao Cheng, Yelong Shen, Xiaodong Liu, Pengcheng He, Weizhu Chen, Jianfeng Gao, Barlas Oguz, Xilun Chen, Vladimir Karpukhin, Stan Peshterliev, Dmytro Okhonko, Michael Schlichtkrull, Sonal Gupta, Yashar Mehdad, Wen-tau Yih

TL;DR

The paper analyzes the NeurIPS 2020 EfficientQA competition, which benchmarks open-domain QA under strict on-disk memory budgets. It details the competition design, data, evaluation metrics, and the diverse retrieval-reader architectures that achieved state-of-the-art results within budget. A key contribution is the demonstration that memory-constrained systems can significantly outperform baselines through targeted retrieval compression, model sharing, and corpus augmentation, while highlighting thegap between automatic exact-match metrics and human judgments on semantically correct answers. The work also introduces a human-correctness annotation scheme to better capture ambiguity and time-dependence of answers, and it discusses the implications for evaluation in future efficient QA challenges. Overall, the findings provide practical guidance for designing memory-efficient open-domain QA systems and for improving QA evaluation in realistic settings.

Abstract

We review the EfficientQA competition from NeurIPS 2020. The competition focused on open-domain question answering (QA), where systems take natural language questions as input and return natural language answers. The aim of the competition was to build systems that can predict correct answers while also satisfying strict on-disk memory budgets. These memory budgets were designed to encourage contestants to explore the trade-off between storing retrieval corpora or the parameters of learned models. In this report, we describe the motivation and organization of the competition, review the best submissions, and analyze system predictions to inform a discussion of evaluation for open-domain QA.

NeurIPS 2020 EfficientQA Competition: Systems, Analyses and Lessons Learned

TL;DR

The paper analyzes the NeurIPS 2020 EfficientQA competition, which benchmarks open-domain QA under strict on-disk memory budgets. It details the competition design, data, evaluation metrics, and the diverse retrieval-reader architectures that achieved state-of-the-art results within budget. A key contribution is the demonstration that memory-constrained systems can significantly outperform baselines through targeted retrieval compression, model sharing, and corpus augmentation, while highlighting thegap between automatic exact-match metrics and human judgments on semantically correct answers. The work also introduces a human-correctness annotation scheme to better capture ambiguity and time-dependence of answers, and it discusses the implications for evaluation in future efficient QA challenges. Overall, the findings provide practical guidance for designing memory-efficient open-domain QA systems and for improving QA evaluation in realistic settings.

Abstract

We review the EfficientQA competition from NeurIPS 2020. The competition focused on open-domain question answering (QA), where systems take natural language questions as input and return natural language answers. The aim of the competition was to build systems that can predict correct answers while also satisfying strict on-disk memory budgets. These memory budgets were designed to encourage contestants to explore the trade-off between storing retrieval corpora or the parameters of learned models. In this report, we describe the motivation and organization of the competition, review the best submissions, and analyze system predictions to inform a discussion of evaluation for open-domain QA.

Paper Structure

This paper contains 55 sections, 2 figures, 9 tables.

Figures (2)

  • Figure 1: Memory footprint of each system component.
  • Figure 2: (Left) Agreement between system predictions. (Right) Ensemble oracle accuracy, which considers a prediction correct if at least one of the system predictions is correct (based on "definitely correct" human evaluation).