Table of Contents
Fetching ...

SCRIBE: Structured Chain Reasoning for Interactive Behaviour Explanations using Tool Calling

Fares Fawzi, Vinitra Swamy, Dominik Glandorf, Tanya Nazaretsky, Tanja Käser

TL;DR

The paper addresses the need for privacy-preserving, locally runnable educational assistants capable of grounded, pedagogically valid feedback. It introduces SCRIBE, a framework that combines multi-hop, tool-augmented reasoning with a self-reflection loop and a two-stage LoRA finetuning process to distill GPT-4o behaviors into 8B open-source models. A synthetic data pipeline and six domain-specific tools support grounded explanations, while a GPT-based judge and a real-user study demonstrate that 8B-SCRIBE matches or exceeds much larger models on key dimensions like relevance and actionability. The results indicate SCRIBE’s viability for low-resource, privacy-sensitive educational deployments and offer a scalable path toward trustworthy, interactive feedback systems.

Abstract

Language models can be used to provide interactive, personalized student feedback in educational settings. However, real-world deployment faces three key challenges: privacy concerns, limited computational resources, and the need for pedagogically valid responses. These constraints require small, open-source models that can run locally and reliably ground their outputs in correct information. We introduce SCRIBE, a framework for multi-hop, tool-augmented reasoning designed to generate valid responses to student questions about feedback reports. SCRIBE combines domain-specific tools with a self-reflective inference pipeline that supports iterative reasoning, tool use, and error recovery. We distil these capabilities into 3B and 8B models via two-stage LoRA fine-tuning on synthetic GPT-4o-generated data. Evaluation with a human-aligned GPT-Judge and a user study with 108 students shows that 8B-SCRIBE models achieve comparable or superior quality to much larger models in key dimensions such as relevance and actionability, while being perceived on par with GPT-4o and Llama-3.3 70B by students. These findings demonstrate the viability of SCRIBE for low-resource, privacy-sensitive educational applications.

SCRIBE: Structured Chain Reasoning for Interactive Behaviour Explanations using Tool Calling

TL;DR

The paper addresses the need for privacy-preserving, locally runnable educational assistants capable of grounded, pedagogically valid feedback. It introduces SCRIBE, a framework that combines multi-hop, tool-augmented reasoning with a self-reflection loop and a two-stage LoRA finetuning process to distill GPT-4o behaviors into 8B open-source models. A synthetic data pipeline and six domain-specific tools support grounded explanations, while a GPT-based judge and a real-user study demonstrate that 8B-SCRIBE matches or exceeds much larger models on key dimensions like relevance and actionability. The results indicate SCRIBE’s viability for low-resource, privacy-sensitive educational deployments and offer a scalable path toward trustworthy, interactive feedback systems.

Abstract

Language models can be used to provide interactive, personalized student feedback in educational settings. However, real-world deployment faces three key challenges: privacy concerns, limited computational resources, and the need for pedagogically valid responses. These constraints require small, open-source models that can run locally and reliably ground their outputs in correct information. We introduce SCRIBE, a framework for multi-hop, tool-augmented reasoning designed to generate valid responses to student questions about feedback reports. SCRIBE combines domain-specific tools with a self-reflective inference pipeline that supports iterative reasoning, tool use, and error recovery. We distil these capabilities into 3B and 8B models via two-stage LoRA fine-tuning on synthetic GPT-4o-generated data. Evaluation with a human-aligned GPT-Judge and a user study with 108 students shows that 8B-SCRIBE models achieve comparable or superior quality to much larger models in key dimensions such as relevance and actionability, while being perceived on par with GPT-4o and Llama-3.3 70B by students. These findings demonstrate the viability of SCRIBE for low-resource, privacy-sensitive educational applications.

Paper Structure

This paper contains 39 sections, 3 equations, 18 figures, 8 tables.

Figures (18)

  • Figure 1: Structured multi-hop reasoning for pedagogically valid feedback via tool calls. The question is addressed using distinct reasoning strategies: one model uses multi-step analysis of learner behavior for a personalized advice (left), the other links it to effective learning behavior dimensions for general guidance (right).
  • Figure 2: SCRIBE Data Generation Pipeline. Synthetic data is generated by collecting questions from students to guide expert annotators in identifying essential tools (Stage 1). GPT-4o generates reasoning chains with these tools, and GPT-4.1 filters the outputs based on actionability, relevance, tool use, and correctness (Stage 2).
  • Figure 3: SCRIBE finetuning, inference, and evaluation pipelines. Finetuning involves two successive LoRA stages for multi-hop reasoning with tool use. Inference operates as a closed-loop system with self-reflection prompting for error correction. Evaluation combines GPT-as-a-judge assessments and a user study.
  • Figure 4: Percentage of YES given by GPT-Judge for each criterion on a holdout dataset of GEO, DSP and VA MOOCs (top) and a holdout set of LNV MOOC (bottom). Hashed bars indicate SCRIBE models
  • Figure 5: Average ratings from 108 students (1–5 scale) for LLama-3.3 70B, GPT-4o and ToolACE-8B SCRIBE.
  • ...and 13 more figures