Table of Contents
Fetching ...

EduMod-LLM: A Modular Approach for Designing Flexible and Transparent Educational Assistants

Meenakshi Mittal, Rishi Khare, Mihran Miroyan, Chancharik Mitra, Narges Norouzi

TL;DR

EduMod-LLM presents a modular function-calling framework for educational QA that isolates function calling, retrieval, and generation to enable fine-grained analysis on real student questions. It introduces an LLM-as-a-Judge module aligned with TA standards to automate pedagogical-quality evaluation at scale. The work demonstrates that structure-aware retrieval and multihop function calling substantially improve retrieval relevance and response quality, with GPT-4.1 delivering strong generation performance. Overall, modular design enhances transparency, adaptability across courses, and pedagogical alignment in educational AI assistants.

Abstract

With the growing use of Large Language Model (LLM)-based Question-Answering (QA) systems in education, it is critical to evaluate their performance across individual pipeline components. In this work, we introduce {\model}, a modular function-calling LLM pipeline, and present a comprehensive evaluation along three key axes: function calling strategies, retrieval methods, and generative language models. Our framework enables fine-grained analysis by isolating and assessing each component. We benchmark function-calling performance across LLMs, compare our novel structure-aware retrieval method to vector-based and LLM-scoring baselines, and evaluate various LLMs for response synthesis. This modular approach reveals specific failure modes and performance patterns, supporting the development of interpretable and effective educational QA systems. Our findings demonstrate the value of modular function calling in improving system transparency and pedagogical alignment. Website and Supplementary Material: https://chancharikmitra.github.io/EduMod-LLM-website/

EduMod-LLM: A Modular Approach for Designing Flexible and Transparent Educational Assistants

TL;DR

EduMod-LLM presents a modular function-calling framework for educational QA that isolates function calling, retrieval, and generation to enable fine-grained analysis on real student questions. It introduces an LLM-as-a-Judge module aligned with TA standards to automate pedagogical-quality evaluation at scale. The work demonstrates that structure-aware retrieval and multihop function calling substantially improve retrieval relevance and response quality, with GPT-4.1 delivering strong generation performance. Overall, modular design enhances transparency, adaptability across courses, and pedagogical alignment in educational AI assistants.

Abstract

With the growing use of Large Language Model (LLM)-based Question-Answering (QA) systems in education, it is critical to evaluate their performance across individual pipeline components. In this work, we introduce {\model}, a modular function-calling LLM pipeline, and present a comprehensive evaluation along three key axes: function calling strategies, retrieval methods, and generative language models. Our framework enables fine-grained analysis by isolating and assessing each component. We benchmark function-calling performance across LLMs, compare our novel structure-aware retrieval method to vector-based and LLM-scoring baselines, and evaluate various LLMs for response synthesis. This modular approach reveals specific failure modes and performance patterns, supporting the development of interpretable and effective educational QA systems. Our findings demonstrate the value of modular function calling in improving system transparency and pedagogical alignment. Website and Supplementary Material: https://chancharikmitra.github.io/EduMod-LLM-website/

Paper Structure

This paper contains 30 sections, 3 figures, 3 tables.

Figures (3)

  • Figure 1: EduMod-LLM EduMod-LLM is a modular approach to developing an LLM-based pipeline for answering student questions. We explore design considerations in function-calling, retrieval, and LLM response generation via a high-quality and scalable evaluation module that leverages expert TA insights. This approach enables flexibility to courses and technical constraints as well as transparency for iterative improvement.
  • Figure 2: Exact Match vs. MAE for LLM-as-a-Judge model across factuality, relevance, and style. DeepSeek generally achieves the best alignment with TA responses when measured across exact match and MAE.
  • Figure 3: Function-calling F1 Scores by Model and Pipeline. GPT-4o and GPT-4.1 achieve the highest accuracy across all models, while fc_multihop is the best . The rule-based Edison does not rely on LLM function-calling and thus achieves the same score for each model.