Table of Contents
Fetching ...

Researchy Questions: A Dataset of Multi-Perspective, Decompositional Questions for LLM Web Agents

Corby Rosset, Ho-Lam Chung, Guanghui Qin, Ethan C. Chau, Zhuo Feng, Ahmed Awadallah, Jennifer Neville, Nikhil Rao

TL;DR

This paper introduces Researchy Questions, a large-scale dataset of non-factoid, decompositional, multi-perspective questions mined from real search logs to probe LLM web agents. A five-stage pipeline (mining, non-factoid filtering, decompositional filtering, deduplication, and GPT-4 quality screening) yields about 96k questions with clicked ClueWeb22 URLs, accompanied by two-level decomposition plans. Characterization shows these questions demand substantial information retrieval and multi-faceted reasoning, with engagement signals indicating real-world search effort. Evaluations reveal that decompositional answering strategies outperform direct answers, particularly for long-form questions, suggesting promising directions for agentic QA systems and retrieval-augmented workflows. The work provides a foundation for new evaluation metrics and further exploration of pivotal facts and sub-question quality in web-based QA.

Abstract

Existing question answering (QA) datasets are no longer challenging to most powerful Large Language Models (LLMs). Traditional QA benchmarks like TriviaQA, NaturalQuestions, ELI5 and HotpotQA mainly study ``known unknowns'' with clear indications of both what information is missing, and how to find it to answer the question. Hence, good performance on these benchmarks provides a false sense of security. A yet unmet need of the NLP community is a bank of non-factoid, multi-perspective questions involving a great deal of unclear information needs, i.e. ``unknown uknowns''. We claim we can find such questions in search engine logs, which is surprising because most question-intent queries are indeed factoid. We present Researchy Questions, a dataset of search engine queries tediously filtered to be non-factoid, ``decompositional'' and multi-perspective. We show that users spend a lot of ``effort'' on these questions in terms of signals like clicks and session length, and that they are also challenging for GPT-4. We also show that ``slow thinking'' answering techniques, like decomposition into sub-questions shows benefit over answering directly. We release $\sim$ 100k Researchy Questions, along with the Clueweb22 URLs that were clicked.

Researchy Questions: A Dataset of Multi-Perspective, Decompositional Questions for LLM Web Agents

TL;DR

This paper introduces Researchy Questions, a large-scale dataset of non-factoid, decompositional, multi-perspective questions mined from real search logs to probe LLM web agents. A five-stage pipeline (mining, non-factoid filtering, decompositional filtering, deduplication, and GPT-4 quality screening) yields about 96k questions with clicked ClueWeb22 URLs, accompanied by two-level decomposition plans. Characterization shows these questions demand substantial information retrieval and multi-faceted reasoning, with engagement signals indicating real-world search effort. Evaluations reveal that decompositional answering strategies outperform direct answers, particularly for long-form questions, suggesting promising directions for agentic QA systems and retrieval-augmented workflows. The work provides a foundation for new evaluation metrics and further exploration of pivotal facts and sub-question quality in web-based QA.

Abstract

Existing question answering (QA) datasets are no longer challenging to most powerful Large Language Models (LLMs). Traditional QA benchmarks like TriviaQA, NaturalQuestions, ELI5 and HotpotQA mainly study ``known unknowns'' with clear indications of both what information is missing, and how to find it to answer the question. Hence, good performance on these benchmarks provides a false sense of security. A yet unmet need of the NLP community is a bank of non-factoid, multi-perspective questions involving a great deal of unclear information needs, i.e. ``unknown uknowns''. We claim we can find such questions in search engine logs, which is surprising because most question-intent queries are indeed factoid. We present Researchy Questions, a dataset of search engine queries tediously filtered to be non-factoid, ``decompositional'' and multi-perspective. We show that users spend a lot of ``effort'' on these questions in terms of signals like clicks and session length, and that they are also challenging for GPT-4. We also show that ``slow thinking'' answering techniques, like decomposition into sub-questions shows benefit over answering directly. We release 100k Researchy Questions, along with the Clueweb22 URLs that were clicked.
Paper Structure (17 sections, 11 figures, 10 tables)

This paper contains 17 sections, 11 figures, 10 tables.

Figures (11)

  • Figure 1: Qualitative comparison of how Researchy Questions differs from other Question Answering datasets. Researchy Questions involve a greater deal of complexity and "unknown unknowns" than other QA datasets.
  • Figure 2: (Right) Histogram of number of documents clicked per question for Researchy Questions which is much higher than for general web search queries. (Left) number of queries associated with each document. The fact that not very many queries are associated with each document validates the effectiveness of our query-deduplication procedure.
  • Figure 3: (Left) Non-factoid scores of the 15.7M QnA Queries. The roughly 1M queries whose score exceeded the threshold +0.75 were sent to the Decompositional classifier. Note that because this was a binary classifier, 89% of the non-factoid scores were less than -0.75, which is cut off from the left-hand histogram to make it easier to visualize. (Right) The Decompositional classifier's scores of the roughly 1M Non-factoid queries. Around 146k queries exceeding the 0.6 threshold line resulted are considered both Non-factoid and Decompositional, and were then de-duplicated to arrive at the final Researchy Questions dataset of around 100k.
  • Figure 4: Prompt given to text-davinci-003 to collect labels of whether a question is non-factoid. The current question is substituted at the end. Labels on 1-10 were binarized based to train the non-factoid classifier.
  • Figure 5: Prompt given to gpt-35-turbo to collect labels of how appropriate a question is for "decomposition" into sub-questions. These labels were used to train the Decompositional classifier.
  • ...and 6 more figures