Table of Contents
Fetching ...

PustakAI: Curriculum-Aligned and Interactive Textbooks Using Large Language Models

Shivam Sharma, Riya Naik, Tejas Gawas, Heramb Patil, Kunal Korgaonkar

TL;DR

The paper addresses the challenge of building curriculum-aligned QA for education by introducing the NCERT-QA dataset and the PustakAI framework. It employs a retrieval-augmented prompting pipeline with varied strategies (including meta-prompting and one-shot prompts) to test model performance across English and Science for grades 6–8. Key findings show that contextual grounding is essential for accurate, faithful answers, with larger models and meta-prompting delivering the best balance of accuracy and efficiency. The work offers a practical path toward deploying curriculum-aware educational AI in resource-constrained schools and sets the stage for expanding to additional grades and subjects.

Abstract

Large Language Models (LLMs) have demonstrated remarkable capabilities in understanding and generating human-like content. This has revolutionized various sectors such as healthcare, software development, and education. In education, LLMs offer potential for personalized and interactive learning experiences, especially in regions with limited teaching resources. However, adapting these models effectively to curriculum-specific content, such as the National Council of Educational Research and Training (NCERT) syllabus in India, presents unique challenges in terms of accuracy, alignment, and pedagogical relevance. In this paper, we present the framework "PustakAI"\footnote{Pustak means `book' in many Indian languages.} for the design and evaluation of a novel question-answering dataset "NCERT-QA" aligned with the NCERT curriculum for English and Science subjects of grades 6 to 8. We classify the curated QA pairs as Factoid, Inferential, and Others (evaluative and reasoning). We evaluate the dataset with various prompting techniques, such as meta-prompt, few-shot, and CoT-style prompting, using diverse evaluation metrics to understand which approach aligns more efficiently with the structure and demands of the curriculum. Along with the usability of the dataset, we analyze the strengths and limitations of current open-source LLMs (Gemma3:1b, Llama3.2:3b, and Nemotron-mini:4b) and high-end LLMs (Llama-4-Scout-17B and Deepseek-r1-70B) as AI-based learning tools in formal education systems.

PustakAI: Curriculum-Aligned and Interactive Textbooks Using Large Language Models

TL;DR

The paper addresses the challenge of building curriculum-aligned QA for education by introducing the NCERT-QA dataset and the PustakAI framework. It employs a retrieval-augmented prompting pipeline with varied strategies (including meta-prompting and one-shot prompts) to test model performance across English and Science for grades 6–8. Key findings show that contextual grounding is essential for accurate, faithful answers, with larger models and meta-prompting delivering the best balance of accuracy and efficiency. The work offers a practical path toward deploying curriculum-aware educational AI in resource-constrained schools and sets the stage for expanding to additional grades and subjects.

Abstract

Large Language Models (LLMs) have demonstrated remarkable capabilities in understanding and generating human-like content. This has revolutionized various sectors such as healthcare, software development, and education. In education, LLMs offer potential for personalized and interactive learning experiences, especially in regions with limited teaching resources. However, adapting these models effectively to curriculum-specific content, such as the National Council of Educational Research and Training (NCERT) syllabus in India, presents unique challenges in terms of accuracy, alignment, and pedagogical relevance. In this paper, we present the framework "PustakAI"\footnote{Pustak means `book' in many Indian languages.} for the design and evaluation of a novel question-answering dataset "NCERT-QA" aligned with the NCERT curriculum for English and Science subjects of grades 6 to 8. We classify the curated QA pairs as Factoid, Inferential, and Others (evaluative and reasoning). We evaluate the dataset with various prompting techniques, such as meta-prompt, few-shot, and CoT-style prompting, using diverse evaluation metrics to understand which approach aligns more efficiently with the structure and demands of the curriculum. Along with the usability of the dataset, we analyze the strengths and limitations of current open-source LLMs (Gemma3:1b, Llama3.2:3b, and Nemotron-mini:4b) and high-end LLMs (Llama-4-Scout-17B and Deepseek-r1-70B) as AI-based learning tools in formal education systems.

Paper Structure

This paper contains 14 sections, 1 equation, 4 figures, 4 tables.

Figures (4)

  • Figure 1: (a) NCERT-QA dataset curation process. NCERT textbooks are parsed to extract chapters as context and the respective questions. Answers are retrieved from various authentic public online sources and aligned based on chapter and question indices. These answers are used as ground truth. The resulting QA dataset is structured as a collection of context-question-answer tuples. (b) LLM prompting and evaluation pipeline. LLM is presented with chapter and its corresponding question by employing a variety of prompting strategies. Model then generates a response to the question which is compared against ground truth using various evaluation matrices.
  • Figure 2: Subject-wise analysis of the NCERT-QA dataset showing question&answer length distributions and top named entities, highlighting differences in complexity and focus between English and Science.
  • Figure 3: Performance–efficiency trade-off: (a,b) Llama-4-Scout matches DeepSeek-70B performance at far lower inference time. (c) Average runtime per prompt type shows meta and meta-one-shot with reduced latency; Llama models and Gemma3-1B are fastest, DeepSeek-70B slowest.
  • Figure 4: Performance of models on contextual vs. non-contextual prompts. Faithfulness metric is absent for non-contextual prompts since faithfulness measures the match between the provided context and the LLM’s generated answer, which cannot be computed without context.