WikiHint: A Human-Annotated Dataset for Hint Ranking and Generation
Jamshid Mozafari, Florian Gerhold, Adam Jatowt
TL;DR
This paper introduces WikiHint, the first manually verified dataset for hint generation and ranking, derived from Wikipedia with 5,000 hints for 1,000 questions. It details a full pipeline for question sampling, hint generation via crowdsourcing, and rigorous human verification, along with a train/test split designed for robust evaluation. The authors also propose HintRank, a lightweight, encoder-based method to rank hints without requiring heavy LLM evaluation, and they evaluate both hint generation and ranking across answer-aware and answer-agnostic settings. Key findings include that hints improve question answering, shorter hints tend to be more helpful, and encoder-based models can outperform decoders in hint ranking, with finetuning further enhancing performance. These contributions advance human-centered QA augmentation and offer scalable, efficient tools for hint-based assistance and educational applications.
Abstract
The use of Large Language Models (LLMs) has increased significantly with users frequently asking questions to chatbots. In the time when information is readily accessible, it is crucial to stimulate and preserve human cognitive abilities and maintain strong reasoning skills. This paper addresses such challenges by promoting the use of hints as an alternative or a supplement to direct answers. We first introduce a manually constructed hint dataset, WikiHint, which is based on Wikipedia and includes 5,000 hints created for 1,000 questions. We then finetune open-source LLMs for hint generation in answer-aware and answer-agnostic contexts. We assess the effectiveness of the hints with human participants who answer questions with and without the aid of hints. Additionally, we introduce a lightweight evaluation method, HintRank, to evaluate and rank hints in both answer-aware and answer-agnostic settings. Our findings show that (a) the dataset helps generate more effective hints, (b) including answer information along with questions generally improves the quality of generated hints, and (c) encoder-based models perform better than decoder-based models in hint ranking.
