Table of Contents
Fetching ...

Ensemble ToT of LLMs and Its Application to Automatic Grading System for Supporting Self-Learning

Yuki Ito, Qiang Ma

TL;DR

This paper addresses the challenge of providing detailed, timely grading feedback by moving beyond single-LLM grading to an Ensemble ToT framework that coordinates multiple language models. It introduces GET, a grading system that uses pseudo-learning to identify LLM tendencies, generates multiple candidate solutions via Tree-of-Thought, and integrates them through a simulated debate to produce accurate, explainable grading reasons. Empirical results on SAF show GET achieving higher grading-label accuracy and macro F1 on unseen-question/unseen-answer subsets, along with superior automated-quality feedback compared with baselines. The work highlights the practical potential of multi-LLM collaboration for scalable, self-learning support while noting limitations such as dependence on a fixed set of models and the need for user-perception studies.

Abstract

Providing students with detailed and timely grading feedback is essential for self-learning. While existing LLM-based grading systems are promising, most of them rely on one single model, which limits their performance. To address this, we propose Ensemble Tree-of-Thought (ToT), a framework that enhances LLM outputs by integrating multiple models. Using this framework, we develop a grading system. Ensemble ToT follows three steps: (1) analyzing LLM performance, (2) generating candidate answers, and (3) refining them into a final result. Based on this, our grading system first evaluates the grading tendencies of LLMs, then generates multiple results, and finally integrates them via a simulated debate. Experimental results demonstrate our approach's ability to provide accurate and explainable grading by effectively coordinating multiple LLMs.

Ensemble ToT of LLMs and Its Application to Automatic Grading System for Supporting Self-Learning

TL;DR

This paper addresses the challenge of providing detailed, timely grading feedback by moving beyond single-LLM grading to an Ensemble ToT framework that coordinates multiple language models. It introduces GET, a grading system that uses pseudo-learning to identify LLM tendencies, generates multiple candidate solutions via Tree-of-Thought, and integrates them through a simulated debate to produce accurate, explainable grading reasons. Empirical results on SAF show GET achieving higher grading-label accuracy and macro F1 on unseen-question/unseen-answer subsets, along with superior automated-quality feedback compared with baselines. The work highlights the practical potential of multi-LLM collaboration for scalable, self-learning support while noting limitations such as dependence on a fixed set of models and the need for user-perception studies.

Abstract

Providing students with detailed and timely grading feedback is essential for self-learning. While existing LLM-based grading systems are promising, most of them rely on one single model, which limits their performance. To address this, we propose Ensemble Tree-of-Thought (ToT), a framework that enhances LLM outputs by integrating multiple models. Using this framework, we develop a grading system. Ensemble ToT follows three steps: (1) analyzing LLM performance, (2) generating candidate answers, and (3) refining them into a final result. Based on this, our grading system first evaluates the grading tendencies of LLMs, then generates multiple results, and finally integrates them via a simulated debate. Experimental results demonstrate our approach's ability to provide accurate and explainable grading by effectively coordinating multiple LLMs.

Paper Structure

This paper contains 47 sections, 5 equations, 25 figures, 11 tables, 1 algorithm.

Figures (25)

  • Figure 1: Overview of Ensemble ToT Framework: The framework integrates ensemble learning techniques with the Tree-of-Thought (ToT) approach. It identifies individual LLM performance tendencies and synthesizes multiple candidate solutions generated by LLMs into a single refined result.
  • Figure 2: Process Diagram of GET: The system consists of three stages: pseudo-learning, multi-LLM grading, and debate Integration. This enables accurate grading by taking advantage of the characteristics of multiple models.
  • Figure 3: The prompt used for analyzing labeling tendencies: This figure illustrates the prompt provided to the LLM during the labeling tendencies analysis. It includes three-class classification performance metrics in JSON format, along with detailed instructions and the expected output.
  • Figure 4: Process Flow of Multi-LLM Grading
  • Figure 5: Process Flow of Debate Integration
  • ...and 20 more figures