Table of Contents
Fetching ...

MSA at BEA 2025 Shared Task: Disagreement-Aware Instruction Tuning for Multi-Dimensional Evaluation of LLMs as Math Tutors

Baraa Hikal, Mohamed Basem, Islam Oshallah, Ali Hamdi

TL;DR

This work tackles the challenge of evaluating AI tutors across four pedagogical dimensions in math dialogues using MRBench-derived data. It introduces MSA-MathEval, a unified pipeline that fine-tunes a single instruction-tuned model, Mathstral-7B-v0.1, with LoRA adapters and a disagreement-aware ensemble to handle all tracks without task-specific architectural changes, and it frames predictions via macro-F1 optimization. A key contribution is the disagreement-driven inference strategy, which preserves minority labels (e.g., 'To some extent') by aligning predictions with the development distribution, improving per-class recall under macro-F1 evaluation. The approach achieves strong results, including 1st place in Providing Guidance and top-5 across all tracks, demonstrating scalable, robust multi-dimensional evaluation of LLMs as math tutors and highlighting avenues for cross-domain generalization and calibration.

Abstract

We present MSA-MathEval, our submission to the BEA 2025 Shared Task on evaluating AI tutor responses across four instructional dimensions: Mistake Identification, Mistake Location, Providing Guidance, and Actionability. Our approach uses a unified training pipeline to fine-tune a single instruction-tuned language model across all tracks, without any task-specific architectural changes. To improve prediction reliability, we introduce a disagreement-aware ensemble inference strategy that enhances coverage of minority labels. Our system achieves strong performance across all tracks, ranking 1st in Providing Guidance, 3rd in Actionability, and 4th in both Mistake Identification and Mistake Location. These results demonstrate the effectiveness of scalable instruction tuning and disagreement-driven modeling for robust, multi-dimensional evaluation of LLMs as educational tutors.

MSA at BEA 2025 Shared Task: Disagreement-Aware Instruction Tuning for Multi-Dimensional Evaluation of LLMs as Math Tutors

TL;DR

This work tackles the challenge of evaluating AI tutors across four pedagogical dimensions in math dialogues using MRBench-derived data. It introduces MSA-MathEval, a unified pipeline that fine-tunes a single instruction-tuned model, Mathstral-7B-v0.1, with LoRA adapters and a disagreement-aware ensemble to handle all tracks without task-specific architectural changes, and it frames predictions via macro-F1 optimization. A key contribution is the disagreement-driven inference strategy, which preserves minority labels (e.g., 'To some extent') by aligning predictions with the development distribution, improving per-class recall under macro-F1 evaluation. The approach achieves strong results, including 1st place in Providing Guidance and top-5 across all tracks, demonstrating scalable, robust multi-dimensional evaluation of LLMs as math tutors and highlighting avenues for cross-domain generalization and calibration.

Abstract

We present MSA-MathEval, our submission to the BEA 2025 Shared Task on evaluating AI tutor responses across four instructional dimensions: Mistake Identification, Mistake Location, Providing Guidance, and Actionability. Our approach uses a unified training pipeline to fine-tune a single instruction-tuned language model across all tracks, without any task-specific architectural changes. To improve prediction reliability, we introduce a disagreement-aware ensemble inference strategy that enhances coverage of minority labels. Our system achieves strong performance across all tracks, ranking 1st in Providing Guidance, 3rd in Actionability, and 4th in both Mistake Identification and Mistake Location. These results demonstrate the effectiveness of scalable instruction tuning and disagreement-driven modeling for robust, multi-dimensional evaluation of LLMs as educational tutors.

Paper Structure

This paper contains 20 sections, 3 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Overview of our unified MSA-MathEval framework for the BEA 2025 Shared Task. The pipeline includes preprocessing, LoRA-based fine-tuning of Mathstral-7B-v0.1, and disagreement-aware ensemble inference.
  • Figure 2: LoRA adaptation adds trainable low-rank matrices $A$ and $B$ to frozen attention weights $W_0$, producing an effective weight $W = W_0 + \alpha AB$ during training. Only $A$ and $B$ are updated, enabling memory-efficient fine-tuning hu2021lora.
  • Figure 3: Label distribution comparison across tracks and systems. Each group shows the percentage of predictions per label ("Yes","No", "To some extent", ) for the dev set, single model, and ensemble.