Table of Contents
Fetching ...

Savaal: Scalable Concept-Driven Question Generation to Enhance Human Learning

Kimia Noorbakhsh, Joseph Chandler, Pantea Karimi, Mohammad Alizadeh, Hari Balakrishnan

TL;DR

Savaal introduces a scalable, domain-independent pipeline for generating high-quality, concept-driven multiple-choice questions from long documents. By extracting main ideas, retrieving targeted passages with ColBERT, and guiding an LLM to generate questions and distractors, the method achieves deeper questioning than direct prompting, especially on dissertations, while maintaining cost efficiency at scale. Evaluations with human experts show substantial improvements in depth and usability over baselines, though AI judges exhibit misalignment with human judgments, underscoring the challenge of automated evaluation. The work points to future enhancements in adaptive difficulty, human feedback integration, and broader domain validation to maximize learning impact across diverse materials.

Abstract

Assessing and enhancing human learning through question-answering is vital, yet automating this process remains challenging. While large language models (LLMs) excel at summarization and query responses, their ability to generate meaningful questions for learners is underexplored. We propose Savaal, a scalable question-generation system with three objectives: (i) scalability, enabling question generation from hundreds of pages of text (ii) depth of understanding, producing questions beyond factual recall to test conceptual reasoning, and (iii) domain-independence, automatically generating questions across diverse knowledge areas. Instead of providing an LLM with large documents as context, Savaal improves results with a three-stage processing pipeline. Our evaluation with 76 human experts on 71 papers and PhD dissertations shows that Savaal generates questions that better test depth of understanding by 6.5X for dissertations and 1.5X for papers compared to a direct-prompting LLM baseline. Notably, as document length increases, Savaal's advantages in higher question quality and lower cost become more pronounced.

Savaal: Scalable Concept-Driven Question Generation to Enhance Human Learning

TL;DR

Savaal introduces a scalable, domain-independent pipeline for generating high-quality, concept-driven multiple-choice questions from long documents. By extracting main ideas, retrieving targeted passages with ColBERT, and guiding an LLM to generate questions and distractors, the method achieves deeper questioning than direct prompting, especially on dissertations, while maintaining cost efficiency at scale. Evaluations with human experts show substantial improvements in depth and usability over baselines, though AI judges exhibit misalignment with human judgments, underscoring the challenge of automated evaluation. The work points to future enhancements in adaptive difficulty, human feedback integration, and broader domain validation to maximize learning impact across diverse materials.

Abstract

Assessing and enhancing human learning through question-answering is vital, yet automating this process remains challenging. While large language models (LLMs) excel at summarization and query responses, their ability to generate meaningful questions for learners is underexplored. We propose Savaal, a scalable question-generation system with three objectives: (i) scalability, enabling question generation from hundreds of pages of text (ii) depth of understanding, producing questions beyond factual recall to test conceptual reasoning, and (iii) domain-independence, automatically generating questions across diverse knowledge areas. Instead of providing an LLM with large documents as context, Savaal improves results with a three-stage processing pipeline. Our evaluation with 76 human experts on 71 papers and PhD dissertations shows that Savaal generates questions that better test depth of understanding by 6.5X for dissertations and 1.5X for papers compared to a direct-prompting LLM baseline. Notably, as document length increases, Savaal's advantages in higher question quality and lower cost become more pronounced.

Paper Structure

This paper contains 38 sections, 30 figures, 3 tables.

Figures (30)

  • Figure 1: Savaal's Pipeline. Savaal extracts main ideas from sections of the document in parallel, combines them into a succinct list, and ranks them in order of importance. Next, Savaal fetches relevant passages from the document using a vector-based retrieval model. Finally, given a main idea and fetched passages, Savaal generates questions.
  • Figure 2: Summary of human evaluation: The charts show the percentage and standard error of respondents who Disagree or Somewhat Disagree with questions on understanding, choice quality, and usability. Lower values indicate better performance.
  • Figure 3: Expert preferences for 21 PhD dissertations. Each point shows the number of Agrees or Somewhat Agrees in a 10-question quiz for each of Savaal and Direct. The majority of experts prefer Savaal to Direct on depth of understanding, quality of choices, and usability on long documents (experts above $y=x$ prefer Savaal).
  • Figure 4: Human expert preferences for 55 experts on short conference papers. Each point shows the number of Agrees in a 10-question quiz for Savaal and Direct respectively. More experts prefer Savaal to Direct on the depth of understanding. Experts don't exhibit any preference between the quality of choices and usability on short documents (experts above $y=x$ prefer Savaal).
  • Figure 5: Score distribution for 0 questions from dissertations: GPT-4o as a judge does not align with humans for assessing the metrics.
  • ...and 25 more figures