Table of Contents
Fetching ...

JudgeBoard: Benchmarking and Enhancing Small Language Models for Reasoning Evaluation

Zhenyu Bi, Gaurav Srivastava, Yang Li, Meng Lu, Swastik Roy, Morteza Ziyadi, Xuan Wang

TL;DR

JudgeBoard introduces a direct-evaluation paradigm where small language models serve as judges to assess the correctness of reasoning outputs, contrasting with traditional comparator-based approaches. It constructs task-specific leaderboards over math and science domains using both accuracy-based and Elo-style metrics, enabling fine-grained judge comparison. The paper further proposes Multi-Agent Judging (MAJ), a collaborative, profiler-guided framework that leverages multiple SLMs to approximate LLM-level judgment accuracy, significantly narrowing the gap on benchmarks such as MATH. Empirical results show a substantial performance gap between isolated SLMs and LLMs, but that MAJ can boost SLM reliability and, in some cases, surpass the best LLMs in judgment tasks, suggesting scalable, efficient paths for scalable model assessment.

Abstract

While small language models (SLMs) have shown promise on various reasoning tasks, their ability to judge the correctness of answers remains unclear compared to large language models (LLMs). Prior work on LLM-as-a-judge frameworks typically relies on comparing candidate answers against ground-truth labels or other candidate answers using predefined metrics like entailment. However, this approach is inherently indirect and difficult to fully automate, offering limited support for fine-grained and scalable evaluation of reasoning outputs. In this work, we propose JudgeBoard, a novel evaluation pipeline that directly queries models to assess the correctness of candidate answers without requiring extra answer comparisons. We focus on two core reasoning domains: mathematical reasoning and science/commonsense reasoning, and construct task-specific evaluation leaderboards using both accuracy-based ranking and an Elo-based rating system across five benchmark datasets, enabling consistent model comparison as judges rather than comparators. To improve judgment performance in lightweight models, we propose MAJ (Multi-Agent Judging), a novel multi-agent evaluation framework that leverages multiple interacting SLMs with distinct reasoning profiles to approximate LLM-level judgment accuracy through collaborative deliberation. Experimental results reveal a significant performance gap between SLMs and LLMs in isolated judging tasks. However, our MAJ framework substantially improves the reliability and consistency of SLMs. On the MATH dataset, MAJ using smaller-sized models as backbones performs comparatively well or even better than their larger-sized counterparts. Our findings highlight that multi-agent SLM systems can potentially match or exceed LLM performance in judgment tasks, with implications for scalable and efficient assessment.

JudgeBoard: Benchmarking and Enhancing Small Language Models for Reasoning Evaluation

TL;DR

JudgeBoard introduces a direct-evaluation paradigm where small language models serve as judges to assess the correctness of reasoning outputs, contrasting with traditional comparator-based approaches. It constructs task-specific leaderboards over math and science domains using both accuracy-based and Elo-style metrics, enabling fine-grained judge comparison. The paper further proposes Multi-Agent Judging (MAJ), a collaborative, profiler-guided framework that leverages multiple SLMs to approximate LLM-level judgment accuracy, significantly narrowing the gap on benchmarks such as MATH. Empirical results show a substantial performance gap between isolated SLMs and LLMs, but that MAJ can boost SLM reliability and, in some cases, surpass the best LLMs in judgment tasks, suggesting scalable, efficient paths for scalable model assessment.

Abstract

While small language models (SLMs) have shown promise on various reasoning tasks, their ability to judge the correctness of answers remains unclear compared to large language models (LLMs). Prior work on LLM-as-a-judge frameworks typically relies on comparing candidate answers against ground-truth labels or other candidate answers using predefined metrics like entailment. However, this approach is inherently indirect and difficult to fully automate, offering limited support for fine-grained and scalable evaluation of reasoning outputs. In this work, we propose JudgeBoard, a novel evaluation pipeline that directly queries models to assess the correctness of candidate answers without requiring extra answer comparisons. We focus on two core reasoning domains: mathematical reasoning and science/commonsense reasoning, and construct task-specific evaluation leaderboards using both accuracy-based ranking and an Elo-based rating system across five benchmark datasets, enabling consistent model comparison as judges rather than comparators. To improve judgment performance in lightweight models, we propose MAJ (Multi-Agent Judging), a novel multi-agent evaluation framework that leverages multiple interacting SLMs with distinct reasoning profiles to approximate LLM-level judgment accuracy through collaborative deliberation. Experimental results reveal a significant performance gap between SLMs and LLMs in isolated judging tasks. However, our MAJ framework substantially improves the reliability and consistency of SLMs. On the MATH dataset, MAJ using smaller-sized models as backbones performs comparatively well or even better than their larger-sized counterparts. Our findings highlight that multi-agent SLM systems can potentially match or exceed LLM performance in judgment tasks, with implications for scalable and efficient assessment.

Paper Structure

This paper contains 46 sections, 4 figures, 20 tables.

Figures (4)

  • Figure 1: Comparison of JudgeBoard and MAJ with previous works. Unlike previous works that usually follow a comparison-based evaluation pipeline, JudgeBoard and MAJ focus on direct evaluation of the factual correctness of the reasoning questions.
  • Figure 2: Overview of the JudgeBoard Pipeline
  • Figure 3: Overall accuracy for using the Deductive Reasoner (DR),Logical Reasoner (LR), and Robust Reasoner (RR) profiles on (a) Algebra, (b) Number Theory, and (c) Counting and Probability tasks.
  • Figure 4: Performance comparison of large language models on mathematical reasoning benchmarks. Each point represents a model, with circle size proportional to parameter count (2B-120B parameters).