Table of Contents
Fetching ...

Overview of the NTCIR-18 Automatic Evaluation of LLMs (AEOLLM) Task

Junjie Chen, Haitao Li, Zhumin Chu, Yiqun Liu, Qingyao Ai

TL;DR

The paper introduces AEOLLM, a public task for automatic, reference-free evaluation of LLMs across four open-ended subtasks: dialogue generation, text expansion, summary generation, and non-factoid QA. It describes dataset construction (2,800 instances from four datasets), human-annotated gold standards, and an automated evaluation pipeline using accuracy, Kendall's tau, and Spearman's rho to gauge agreement with human judgments. Results from 48 runs across four teams reveal that PanguIR leads in accuracy while UCLWI excels in rank-based metrics, with Text Expansion proving most challenging and GPT-4o baselines performing strongly on some metrics, highlighting the benefits of multi-metric evaluation. The work provides a practical dataset, an automated pipeline, and insights into method trade-offs, informing future development of reference-free evaluation methods for LLMs and broader application across tasks.

Abstract

In this paper, we provide an overview of the NTCIR-18 Automatic Evaluation of LLMs (AEOLLM) task. As large language models (LLMs) grow popular in both academia and industry, how to effectively evaluate the capacity of LLMs becomes an increasingly critical but still challenging issue. Existing methods can be divided into two types: manual evaluation, which is expensive, and automatic evaluation, which faces many limitations including task format (the majority belong to multiple-choice questions) and evaluation criteria (occupied by reference-based metrics). To advance the innovation of automatic evaluation, we propose the AEOLLM task which focuses on generative tasks and encourages reference-free methods. Besides, we set up diverse subtasks such as dialogue generation, text expansion, summary generation and non-factoid question answering to comprehensively test different methods. This year, we received 48 runs from 4 teams in total. This paper will describe the background of the task, the data set, the evaluation measures and the evaluation results, respectively.

Overview of the NTCIR-18 Automatic Evaluation of LLMs (AEOLLM) Task

TL;DR

The paper introduces AEOLLM, a public task for automatic, reference-free evaluation of LLMs across four open-ended subtasks: dialogue generation, text expansion, summary generation, and non-factoid QA. It describes dataset construction (2,800 instances from four datasets), human-annotated gold standards, and an automated evaluation pipeline using accuracy, Kendall's tau, and Spearman's rho to gauge agreement with human judgments. Results from 48 runs across four teams reveal that PanguIR leads in accuracy while UCLWI excels in rank-based metrics, with Text Expansion proving most challenging and GPT-4o baselines performing strongly on some metrics, highlighting the benefits of multi-metric evaluation. The work provides a practical dataset, an automated pipeline, and insights into method trade-offs, informing future development of reference-free evaluation methods for LLMs and broader application across tasks.

Abstract

In this paper, we provide an overview of the NTCIR-18 Automatic Evaluation of LLMs (AEOLLM) task. As large language models (LLMs) grow popular in both academia and industry, how to effectively evaluate the capacity of LLMs becomes an increasingly critical but still challenging issue. Existing methods can be divided into two types: manual evaluation, which is expensive, and automatic evaluation, which faces many limitations including task format (the majority belong to multiple-choice questions) and evaluation criteria (occupied by reference-based metrics). To advance the innovation of automatic evaluation, we propose the AEOLLM task which focuses on generative tasks and encourages reference-free methods. Besides, we set up diverse subtasks such as dialogue generation, text expansion, summary generation and non-factoid question answering to comprehensively test different methods. This year, we received 48 runs from 4 teams in total. This paper will describe the background of the task, the data set, the evaluation measures and the evaluation results, respectively.

Paper Structure

This paper contains 15 sections, 2 equations, 1 figure, 3 tables.

Figures (1)

  • Figure 1: The overall framework of the AEOLLM task.