Table of Contents
Fetching ...

When Verification Hurts: Asymmetric Effects of Multi-Agent Feedback in Logic Proof Tutoring

Tahreem Yasir, Sutapa Dey Tithi, Benyamin Tabarsi, Dmitri Droujkov, Sam Gilson Yasitha Rajapaksha, Xiaoyi Tian, Arun Ramesh, DongKuan, Xu, Tiffany Barnes

Abstract

Large language models (LLMs) are increasingly used for automated tutoring, but their reliability in structured symbolic domains remains unclear. We study step-level feedback for propositional logic proofs, which require precise symbolic reasoning aligned with a learner's current proof state. We introduce a knowledge-graph-grounded benchmark of 516 unique proof states with step-level annotations and difficulty metrics. Unlike prior tutoring evaluations that rely on model self-assessment or binary correctness, our framework enables fine-grained analysis of feedback quality against verified solution paths. We evaluate three role-specialized pipelines with varying solution access: Tutor (partial solution access), Teacher (full derivation access), and Judge (verification of Tutor feedback). Our results reveal a striking asymmetry: verification improves outcomes when upstream feedback is error-prone (<70% accuracy), but degrades performance by 4-6 percentage points through over-specification when feedback is already reliable (>85%). Critically, we identify a shared complexity ceiling; no model or pipeline reliably succeeds on proof states exceeding complexity 4-5. These findings challenge the assumption that adding verifiers or richer context universally improves tutoring, motivating adaptive, difficulty-aware architectures that route problems by estimated complexity and upstream reliability.

When Verification Hurts: Asymmetric Effects of Multi-Agent Feedback in Logic Proof Tutoring

Abstract

Large language models (LLMs) are increasingly used for automated tutoring, but their reliability in structured symbolic domains remains unclear. We study step-level feedback for propositional logic proofs, which require precise symbolic reasoning aligned with a learner's current proof state. We introduce a knowledge-graph-grounded benchmark of 516 unique proof states with step-level annotations and difficulty metrics. Unlike prior tutoring evaluations that rely on model self-assessment or binary correctness, our framework enables fine-grained analysis of feedback quality against verified solution paths. We evaluate three role-specialized pipelines with varying solution access: Tutor (partial solution access), Teacher (full derivation access), and Judge (verification of Tutor feedback). Our results reveal a striking asymmetry: verification improves outcomes when upstream feedback is error-prone (<70% accuracy), but degrades performance by 4-6 percentage points through over-specification when feedback is already reliable (>85%). Critically, we identify a shared complexity ceiling; no model or pipeline reliably succeeds on proof states exceeding complexity 4-5. These findings challenge the assumption that adding verifiers or richer context universally improves tutoring, motivating adaptive, difficulty-aware architectures that route problems by estimated complexity and upstream reliability.

Paper Structure

This paper contains 59 sections, 3 equations, 15 figures, 8 tables.

Figures (15)

  • Figure 2: Illustrative proof instance showing intermediate derivations and differential hint access for Tutor and Teacher agents.
  • Figure 3: Initial proof state and goal specification. The student is presented with the premises (top left) and the target conclusion ($J \lor K$) at the bottom. Available inference rules are displayed on the middle. At this stage, no intermediate steps have been derived, and the student must choose a productive forward step toward the goal.
  • Figure 4: Rule application with guided simplification. After deriving an intermediate conjunction via Modus Ponens, the student applies the Simplification rule. The interface prompts the learner to select the appropriate resulting literal ($G$ or $\lnot H$), illustrating fine-grained, step-level decision making supported by rule constraints.
  • Figure 5: Successful proof completion. The student derives $J$ via Disjunctive Syllogism and applies the Addition rule to reach the target conclusion $J \lor K$. The system confirms completion, reinforcing correct rule sequencing and alignment with the goal state.
  • Figure 6: Base System Prompt for Student
  • ...and 10 more figures