MSA at BEA 2025 Shared Task: Disagreement-Aware Instruction Tuning for Multi-Dimensional Evaluation of LLMs as Math Tutors
Baraa Hikal, Mohamed Basem, Islam Oshallah, Ali Hamdi
TL;DR
This work tackles the challenge of evaluating AI tutors across four pedagogical dimensions in math dialogues using MRBench-derived data. It introduces MSA-MathEval, a unified pipeline that fine-tunes a single instruction-tuned model, Mathstral-7B-v0.1, with LoRA adapters and a disagreement-aware ensemble to handle all tracks without task-specific architectural changes, and it frames predictions via macro-F1 optimization. A key contribution is the disagreement-driven inference strategy, which preserves minority labels (e.g., 'To some extent') by aligning predictions with the development distribution, improving per-class recall under macro-F1 evaluation. The approach achieves strong results, including 1st place in Providing Guidance and top-5 across all tracks, demonstrating scalable, robust multi-dimensional evaluation of LLMs as math tutors and highlighting avenues for cross-domain generalization and calibration.
Abstract
We present MSA-MathEval, our submission to the BEA 2025 Shared Task on evaluating AI tutor responses across four instructional dimensions: Mistake Identification, Mistake Location, Providing Guidance, and Actionability. Our approach uses a unified training pipeline to fine-tune a single instruction-tuned language model across all tracks, without any task-specific architectural changes. To improve prediction reliability, we introduce a disagreement-aware ensemble inference strategy that enhances coverage of minority labels. Our system achieves strong performance across all tracks, ranking 1st in Providing Guidance, 3rd in Actionability, and 4th in both Mistake Identification and Mistake Location. These results demonstrate the effectiveness of scalable instruction tuning and disagreement-driven modeling for robust, multi-dimensional evaluation of LLMs as educational tutors.
