Table of Contents
Fetching ...

KNIGHT: Knowledge Graph-Driven Multiple-Choice Question Generation with Adaptive Hardness Calibration

Mohammad Amanlou, Erfan Shafiee Moghaddam, Yasaman Amou Jafari, Mahdi Noori, Farhan Farsi, Behnam Bahrak

TL;DR

KNIGHT presents a knowledge-graph-driven framework for low-cost, topic-specific MCQ dataset generation with adaptive difficulty. By caching a topic-focused KG and using retrieval-grounded, multi-hop prompts plus a rigorous LLM-based validation suite, it achieves high-quality ML evaluation content with strong subject relevance and controlled hardness. Across biology, history, and mathematics, KNIGHT demonstrates reduced hallucinations, competitive distractors, and model rankings aligned with established benchmarks, while enabling rapid, scalable dataset refreshes. The approach offers practical benefits for RAG evaluation, curriculum design, and benchmark construction, particularly where domain-specific, multi-hop reasoning is essential.

Abstract

With the rise of large language models (LLMs), they have become instrumental in applications such as Retrieval-Augmented Generation (RAG). Yet evaluating these systems remains bottlenecked by the time and cost of building specialized assessment datasets. We introduce KNIGHT, an LLM-based, knowledge-graph-driven framework for generating multiple-choice question (MCQ) datasets from external sources. KNIGHT constructs a topic-specific knowledge graph, a structured and parsimonious summary of entities and relations, that can be reused to generate instructor-controlled difficulty levels, including multi-hop questions, without repeatedly re-feeding the full source text. This knowledge graph acts as a compressed, reusable state, making question generation a cheap read over the graph. We instantiate KNIGHT on Wikipedia/Wikidata while keeping the framework domain- and ontology-agnostic. As a case study, KNIGHT produces six MCQ datasets in History, Biology, and Mathematics. We evaluate quality on five criteria: fluency, unambiguity (single correct answer), topic relevance, option uniqueness, and answerability given the provided sources (as a proxy for hallucination). Results show that KNIGHT enables token- and cost-efficient generation from a reusable graph representation, achieves high quality across these criteria, and yields model rankings aligned with MMLU-style benchmarks, while supporting topic-specific and difficulty-controlled evaluation.

KNIGHT: Knowledge Graph-Driven Multiple-Choice Question Generation with Adaptive Hardness Calibration

TL;DR

KNIGHT presents a knowledge-graph-driven framework for low-cost, topic-specific MCQ dataset generation with adaptive difficulty. By caching a topic-focused KG and using retrieval-grounded, multi-hop prompts plus a rigorous LLM-based validation suite, it achieves high-quality ML evaluation content with strong subject relevance and controlled hardness. Across biology, history, and mathematics, KNIGHT demonstrates reduced hallucinations, competitive distractors, and model rankings aligned with established benchmarks, while enabling rapid, scalable dataset refreshes. The approach offers practical benefits for RAG evaluation, curriculum design, and benchmark construction, particularly where domain-specific, multi-hop reasoning is essential.

Abstract

With the rise of large language models (LLMs), they have become instrumental in applications such as Retrieval-Augmented Generation (RAG). Yet evaluating these systems remains bottlenecked by the time and cost of building specialized assessment datasets. We introduce KNIGHT, an LLM-based, knowledge-graph-driven framework for generating multiple-choice question (MCQ) datasets from external sources. KNIGHT constructs a topic-specific knowledge graph, a structured and parsimonious summary of entities and relations, that can be reused to generate instructor-controlled difficulty levels, including multi-hop questions, without repeatedly re-feeding the full source text. This knowledge graph acts as a compressed, reusable state, making question generation a cheap read over the graph. We instantiate KNIGHT on Wikipedia/Wikidata while keeping the framework domain- and ontology-agnostic. As a case study, KNIGHT produces six MCQ datasets in History, Biology, and Mathematics. We evaluate quality on five criteria: fluency, unambiguity (single correct answer), topic relevance, option uniqueness, and answerability given the provided sources (as a proxy for hallucination). Results show that KNIGHT enables token- and cost-efficient generation from a reusable graph representation, achieves high quality across these criteria, and yields model rankings aligned with MMLU-style benchmarks, while supporting topic-specific and difficulty-controlled evaluation.
Paper Structure (73 sections, 15 equations, 5 figures, 8 tables, 2 algorithms)

This paper contains 73 sections, 15 equations, 5 figures, 8 tables, 2 algorithms.

Figures (5)

  • Figure 1: KNIGHT High-level pipeline. Given a prompt/topic and depth, KNIGHT retrieves evidence, builds a focused KG, generates MCQs, and filters them to produce the final dataset.
  • Figure 2: KNIGHT architecture. (Left) A topic/prompt-driven RAG pipeline retrieves evidence, extracts triples, and curates a compact KG under depth budget $d_{\max}$; (Right) multi-hop paths are sampled to generate questions/distractors and validated for evidence-grounded answerability to form the final MCQA dataset.
  • Figure 3: Distribution of question lengths for each dataset (histograms), demonstrating an approximately normal shape.
  • Figure 4: Entropy distributions by topic and difficulty level visualized using boxen plots with swarm overlays. Difficulty level 3 datasets consistently show higher entropy and wider distributions, reflecting greater model uncertainty.
  • Figure 5: Knowledge‐graph traversal from Hafez to Shiraz, 7th century, Iran, and >90 million, generating MCQ templates of increasing difficulty. The purple path denotes a 1‐hop (Level 1) question.