Table of Contents
Fetching ...

DEXTER: A Benchmark for open-domain Complex Question Answering using LLMs

Venktesh V. Deepali Prabhu, Avishek Anand

TL;DR

DEXTER introduces a unified benchmark and toolkit for open-domain complex QA, evaluating both heterogeneous retrieval and downstream LLM-based answer generation across seven datasets that span compositional, commonsense, ambiguity, and multimodal evidence. The study finds that lexical BM25 and late-interaction models like ColBERTv2 often outperform dense retrievers on complex queries, while large language models still struggle without retrieved context, especially with hybrid table-text evidence. Retrieval augmentation significantly boosts LLM reasoning in open-domain settings (as shown by Oracle-style gold-context experiments), yet gaps remain in handling hybrid data modalities and question ambiguity. The results highlight a substantial need for progress in retrieval quality and hybrid-evidence reasoning to unlock robust open-domain complex QA systems, and the authors provide a reusable toolkit and data to advance this research agenda.

Abstract

Open-domain complex Question Answering (QA) is a difficult task with challenges in evidence retrieval and reasoning. The complexity of such questions could stem from questions being compositional, hybrid evidence, or ambiguity in questions. While retrieval performance for classical QA tasks is well explored, their capabilities for heterogeneous complex retrieval tasks, especially in an open-domain setting, and the impact on downstream QA performance, are relatively unexplored. To address this, in this work, we propose a benchmark composing diverse complex QA tasks and provide a toolkit to evaluate state-of-the-art pre-trained dense and sparse retrieval models in an open-domain setting. We observe that late interaction models and surprisingly lexical models like BM25 perform well compared to other pre-trained dense retrieval models. In addition, since context-based reasoning is critical for solving complex QA tasks, we also evaluate the reasoning capabilities of LLMs and the impact of retrieval performance on their reasoning capabilities. Through experiments, we observe that much progress is to be made in retrieval for complex QA to improve downstream QA performance. Our software and related data can be accessed at https://github.com/VenkteshV/DEXTER

DEXTER: A Benchmark for open-domain Complex Question Answering using LLMs

TL;DR

DEXTER introduces a unified benchmark and toolkit for open-domain complex QA, evaluating both heterogeneous retrieval and downstream LLM-based answer generation across seven datasets that span compositional, commonsense, ambiguity, and multimodal evidence. The study finds that lexical BM25 and late-interaction models like ColBERTv2 often outperform dense retrievers on complex queries, while large language models still struggle without retrieved context, especially with hybrid table-text evidence. Retrieval augmentation significantly boosts LLM reasoning in open-domain settings (as shown by Oracle-style gold-context experiments), yet gaps remain in handling hybrid data modalities and question ambiguity. The results highlight a substantial need for progress in retrieval quality and hybrid-evidence reasoning to unlock robust open-domain complex QA systems, and the authors provide a reusable toolkit and data to advance this research agenda.

Abstract

Open-domain complex Question Answering (QA) is a difficult task with challenges in evidence retrieval and reasoning. The complexity of such questions could stem from questions being compositional, hybrid evidence, or ambiguity in questions. While retrieval performance for classical QA tasks is well explored, their capabilities for heterogeneous complex retrieval tasks, especially in an open-domain setting, and the impact on downstream QA performance, are relatively unexplored. To address this, in this work, we propose a benchmark composing diverse complex QA tasks and provide a toolkit to evaluate state-of-the-art pre-trained dense and sparse retrieval models in an open-domain setting. We observe that late interaction models and surprisingly lexical models like BM25 perform well compared to other pre-trained dense retrieval models. In addition, since context-based reasoning is critical for solving complex QA tasks, we also evaluate the reasoning capabilities of LLMs and the impact of retrieval performance on their reasoning capabilities. Through experiments, we observe that much progress is to be made in retrieval for complex QA to improve downstream QA performance. Our software and related data can be accessed at https://github.com/VenkteshV/DEXTER
Paper Structure (53 sections, 10 figures, 21 tables)

This paper contains 53 sections, 10 figures, 21 tables.

Figures (10)

  • Figure 1: An Overview of dexter Benchmark and ToolKit
  • Figure 2: Effect of # of retrieved docs. using ColBERTv2 on QA perf. (few-shot-cot,gpt-3.5-turbo)
  • Figure 3: Example of In-context learning for MusiqueQA through manual Few-shot-cot based prompting of LLMs (limited examples shown)
  • Figure 4: Example of In-context learning for 2WikiMultiHopQA through manual Few-shot-cot based prompting of LLMs (limited examples shown)
  • Figure 5: Prompt for FinQA
  • ...and 5 more figures