Table of Contents
Fetching ...

TriviaHG: A Dataset for Automatic Hint Generation from Factoid Questions

Jamshid Mozafari, Anubhav Jangra, Adam Jatowt

TL;DR

TriviaHG tackles the risk that direct answers from LLMs can erode human reasoning by proposing hint-based guidance for factoid questions. It introduces a two-module pipeline to construct a large-scale TriviaHG dataset (16,645 questions, 160,230 hints) and pairs it with automatic evaluation metrics for convergence (HICOS) and familiarity (HIFAS). Empirical results show hints can effectively aid users in finding answers, with performance depending on question difficulty, and demonstrate strong alignment between automatic metrics and human judgments. The work enables targeted fine-tuning of generative models and has practical implications for retrieval-augmented generation, query expansion, and educational tooling by providing high-signal hints rather than direct solutions.

Abstract

Nowadays, individuals tend to engage in dialogues with Large Language Models, seeking answers to their questions. In times when such answers are readily accessible to anyone, the stimulation and preservation of human's cognitive abilities, as well as the assurance of maintaining good reasoning skills by humans becomes crucial. This study addresses such needs by proposing hints (instead of final answers or before giving answers) as a viable solution. We introduce a framework for the automatic hint generation for factoid questions, employing it to construct TriviaHG, a novel large-scale dataset featuring 160,230 hints corresponding to 16,645 questions from the TriviaQA dataset. Additionally, we present an automatic evaluation method that measures the Convergence and Familiarity quality attributes of hints. To evaluate the TriviaHG dataset and the proposed evaluation method, we enlisted 10 individuals to annotate 2,791 hints and tasked 6 humans with answering questions using the provided hints. The effectiveness of hints varied, with success rates of 96%, 78%, and 36% for questions with easy, medium, and hard answers, respectively. Moreover, the proposed automatic evaluation methods showed a robust correlation with annotators' results. Conclusively, the findings highlight three key insights: the facilitative role of hints in resolving unknown questions, the dependence of hint quality on answer difficulty, and the feasibility of employing automatic evaluation methods for hint assessment.

TriviaHG: A Dataset for Automatic Hint Generation from Factoid Questions

TL;DR

TriviaHG tackles the risk that direct answers from LLMs can erode human reasoning by proposing hint-based guidance for factoid questions. It introduces a two-module pipeline to construct a large-scale TriviaHG dataset (16,645 questions, 160,230 hints) and pairs it with automatic evaluation metrics for convergence (HICOS) and familiarity (HIFAS). Empirical results show hints can effectively aid users in finding answers, with performance depending on question difficulty, and demonstrate strong alignment between automatic metrics and human judgments. The work enables targeted fine-tuning of generative models and has practical implications for retrieval-augmented generation, query expansion, and educational tooling by providing high-signal hints rather than direct solutions.

Abstract

Nowadays, individuals tend to engage in dialogues with Large Language Models, seeking answers to their questions. In times when such answers are readily accessible to anyone, the stimulation and preservation of human's cognitive abilities, as well as the assurance of maintaining good reasoning skills by humans becomes crucial. This study addresses such needs by proposing hints (instead of final answers or before giving answers) as a viable solution. We introduce a framework for the automatic hint generation for factoid questions, employing it to construct TriviaHG, a novel large-scale dataset featuring 160,230 hints corresponding to 16,645 questions from the TriviaQA dataset. Additionally, we present an automatic evaluation method that measures the Convergence and Familiarity quality attributes of hints. To evaluate the TriviaHG dataset and the proposed evaluation method, we enlisted 10 individuals to annotate 2,791 hints and tasked 6 humans with answering questions using the provided hints. The effectiveness of hints varied, with success rates of 96%, 78%, and 36% for questions with easy, medium, and hard answers, respectively. Moreover, the proposed automatic evaluation methods showed a robust correlation with annotators' results. Conclusively, the findings highlight three key insights: the facilitative role of hints in resolving unknown questions, the dependence of hint quality on answer difficulty, and the feasibility of employing automatic evaluation methods for hint assessment.
Paper Structure (31 sections, 1 equation, 9 figures, 6 tables)

This paper contains 31 sections, 1 equation, 9 figures, 6 tables.

Figures (9)

  • Figure 1: Example hints for two sample questions in TriviaHG with their computed convergence quality (HICOS) and familiarity (HIFAS) provided on a scale of 1 to 5 (1 is the lowest and 5 is the highest quality).
  • Figure 2: Dataset generation framework: Arrows represent the output quantity based on the number of questions, and callouts illustrate the statistics of the training, validation, and test sets. Qs and Hs denote the number of questions and hints, respectively.
  • Figure 3: The hint generation system initiates by prompting the question to Copilot. Then, it produces a snippet serving as the answer. Following this, we assess the correctness of the provided answer. If the answer is correct, we prompt Copilot to generate 10 hints. The numbers in the brackets denote the source pages for both the answer and the hints.
  • Figure 4: Convergence Evaluator: The process begins by directing the question to the Candidate Generator stage, which generates up to twenty candidate answers. Subsequently, each generated candidate answer and the hint undergo an evaluation to determine the validity of the hint for the respective candidate answer. Finally, the results are conveyed to the Scoring stage for computation of the HICOS.
  • Figure 5: Familiarity Evaluator: The Named Entity Recognizer identifies named entities from the hint. Subsequently, the number of views for the Wikipedia page associated with each entity is extracted. Finally, the view count for each entity undergoes a normalization process.
  • ...and 4 more figures