Table of Contents
Fetching ...

WikiHint: A Human-Annotated Dataset for Hint Ranking and Generation

Jamshid Mozafari, Florian Gerhold, Adam Jatowt

TL;DR

This paper introduces WikiHint, the first manually verified dataset for hint generation and ranking, derived from Wikipedia with 5,000 hints for 1,000 questions. It details a full pipeline for question sampling, hint generation via crowdsourcing, and rigorous human verification, along with a train/test split designed for robust evaluation. The authors also propose HintRank, a lightweight, encoder-based method to rank hints without requiring heavy LLM evaluation, and they evaluate both hint generation and ranking across answer-aware and answer-agnostic settings. Key findings include that hints improve question answering, shorter hints tend to be more helpful, and encoder-based models can outperform decoders in hint ranking, with finetuning further enhancing performance. These contributions advance human-centered QA augmentation and offer scalable, efficient tools for hint-based assistance and educational applications.

Abstract

The use of Large Language Models (LLMs) has increased significantly with users frequently asking questions to chatbots. In the time when information is readily accessible, it is crucial to stimulate and preserve human cognitive abilities and maintain strong reasoning skills. This paper addresses such challenges by promoting the use of hints as an alternative or a supplement to direct answers. We first introduce a manually constructed hint dataset, WikiHint, which is based on Wikipedia and includes 5,000 hints created for 1,000 questions. We then finetune open-source LLMs for hint generation in answer-aware and answer-agnostic contexts. We assess the effectiveness of the hints with human participants who answer questions with and without the aid of hints. Additionally, we introduce a lightweight evaluation method, HintRank, to evaluate and rank hints in both answer-aware and answer-agnostic settings. Our findings show that (a) the dataset helps generate more effective hints, (b) including answer information along with questions generally improves the quality of generated hints, and (c) encoder-based models perform better than decoder-based models in hint ranking.

WikiHint: A Human-Annotated Dataset for Hint Ranking and Generation

TL;DR

This paper introduces WikiHint, the first manually verified dataset for hint generation and ranking, derived from Wikipedia with 5,000 hints for 1,000 questions. It details a full pipeline for question sampling, hint generation via crowdsourcing, and rigorous human verification, along with a train/test split designed for robust evaluation. The authors also propose HintRank, a lightweight, encoder-based method to rank hints without requiring heavy LLM evaluation, and they evaluate both hint generation and ranking across answer-aware and answer-agnostic settings. Key findings include that hints improve question answering, shorter hints tend to be more helpful, and encoder-based models can outperform decoders in hint ranking, with finetuning further enhancing performance. These contributions advance human-centered QA augmentation and offer scalable, efficient tools for hint-based assistance and educational applications.

Abstract

The use of Large Language Models (LLMs) has increased significantly with users frequently asking questions to chatbots. In the time when information is readily accessible, it is crucial to stimulate and preserve human cognitive abilities and maintain strong reasoning skills. This paper addresses such challenges by promoting the use of hints as an alternative or a supplement to direct answers. We first introduce a manually constructed hint dataset, WikiHint, which is based on Wikipedia and includes 5,000 hints created for 1,000 questions. We then finetune open-source LLMs for hint generation in answer-aware and answer-agnostic contexts. We assess the effectiveness of the hints with human participants who answer questions with and without the aid of hints. Additionally, we introduce a lightweight evaluation method, HintRank, to evaluate and rank hints in both answer-aware and answer-agnostic settings. Our findings show that (a) the dataset helps generate more effective hints, (b) including answer information along with questions generally improves the quality of generated hints, and (c) encoder-based models perform better than decoder-based models in hint ranking.

Paper Structure

This paper contains 24 sections, 12 figures, 10 tables.

Figures (12)

  • Figure 1: Hints for a question from the WikiHint dataset, with their corresponding rankings (1 being the highest and 5 the lowest) which let users find the answer. Note that the opposite arrangement of the hints would make the answer finding task easiest when hints are read from rank 1 to 5.
  • Figure 2: Pipeline of WikiHint dataset generation. The numbers in the arrows indicate the counts of output questions. Qs and Hs denote the number of questions and hints, respectively.
  • Figure 3: The MTurk Worker interface for generating and ranking hints. ${question} represents the question presented to the worker, ${answer} is the corresponding answer, and ${link} provides the link to the relevant Wikipedia page.
  • Figure 4: The instructions for the hint generation and ranking tasks on the Amazon MTurk platform.
  • Figure 5: The HintRank method. First, the inputs are concatenated, and special tokens are added to format them for the BERT model. Finally, the BERT model determines which hint is more helpful.
  • ...and 7 more figures