Ensemble ToT of LLMs and Its Application to Automatic Grading System for Supporting Self-Learning
Yuki Ito, Qiang Ma
TL;DR
This paper addresses the challenge of providing detailed, timely grading feedback by moving beyond single-LLM grading to an Ensemble ToT framework that coordinates multiple language models. It introduces GET, a grading system that uses pseudo-learning to identify LLM tendencies, generates multiple candidate solutions via Tree-of-Thought, and integrates them through a simulated debate to produce accurate, explainable grading reasons. Empirical results on SAF show GET achieving higher grading-label accuracy and macro F1 on unseen-question/unseen-answer subsets, along with superior automated-quality feedback compared with baselines. The work highlights the practical potential of multi-LLM collaboration for scalable, self-learning support while noting limitations such as dependence on a fixed set of models and the need for user-perception studies.
Abstract
Providing students with detailed and timely grading feedback is essential for self-learning. While existing LLM-based grading systems are promising, most of them rely on one single model, which limits their performance. To address this, we propose Ensemble Tree-of-Thought (ToT), a framework that enhances LLM outputs by integrating multiple models. Using this framework, we develop a grading system. Ensemble ToT follows three steps: (1) analyzing LLM performance, (2) generating candidate answers, and (3) refining them into a final result. Based on this, our grading system first evaluates the grading tendencies of LLMs, then generates multiple results, and finally integrates them via a simulated debate. Experimental results demonstrate our approach's ability to provide accurate and explainable grading by effectively coordinating multiple LLMs.
