Table of Contents
Fetching ...

PanguIR Technical Report for NTCIR-18 AEOLLM Task

Lang Mei, Chong Chen, Jiaxin Mao

TL;DR

The paper tackles the challenge of reference-free evaluation for large language models by introducing AEOLLM, a multi-task benchmark covering dialogue, expansion, summarization, and non-factoid QA. It proposes three core strategies—multi-model collaboration, prompt auto-optimization, and ICL optimization with a retrieval-based in-context example framework—to improve alignment between model-driven scores and human judgments. The authors implement a comprehensive system (PanguIR) that ensembles multiple LLMs, tunes prompts iteratively, and selects high-impact in-context examples via dual retrievers and semantic retrieval, achieving superior performance on AEOLLM, particularly in non-factoid QA. This work offers scalable, bias-mitigated, reference-free evaluation methods that can enhance LLM benchmarking and deployment decisions in practice.

Abstract

As large language models (LLMs) gain widespread attention in both academia and industry, it becomes increasingly critical and challenging to effectively evaluate their capabilities. Existing evaluation methods can be broadly categorized into two types: manual evaluation and automatic evaluation. Manual evaluation, while comprehensive, is often costly and resource-intensive. Conversely, automatic evaluation offers greater scalability but is constrained by the limitations of its evaluation criteria (dominated by reference-based answers). To address these challenges, NTCIR-18 introduced the AEOLLM (Automatic Evaluation of LLMs) task, aiming to encourage reference-free evaluation methods that can overcome the limitations of existing approaches. In this paper, to enhance the evaluation performance of the AEOLLM task, we propose three key methods to improve the reference-free evaluation: 1) Multi-model Collaboration: Leveraging multiple LLMs to approximate human ratings across various subtasks; 2) Prompt Auto-optimization: Utilizing LLMs to iteratively refine the initial task prompts based on evaluation feedback from training samples; and 3) In-context Learning (ICL) Optimization: Based on the multi-task evaluation feedback, we train a specialized in-context example retrieval model, combined with a semantic relevance retrieval model, to jointly identify the most effective in-context learning examples. Experiments conducted on the final dataset demonstrate that our approach achieves superior performance on the AEOLLM task.

PanguIR Technical Report for NTCIR-18 AEOLLM Task

TL;DR

The paper tackles the challenge of reference-free evaluation for large language models by introducing AEOLLM, a multi-task benchmark covering dialogue, expansion, summarization, and non-factoid QA. It proposes three core strategies—multi-model collaboration, prompt auto-optimization, and ICL optimization with a retrieval-based in-context example framework—to improve alignment between model-driven scores and human judgments. The authors implement a comprehensive system (PanguIR) that ensembles multiple LLMs, tunes prompts iteratively, and selects high-impact in-context examples via dual retrievers and semantic retrieval, achieving superior performance on AEOLLM, particularly in non-factoid QA. This work offers scalable, bias-mitigated, reference-free evaluation methods that can enhance LLM benchmarking and deployment decisions in practice.

Abstract

As large language models (LLMs) gain widespread attention in both academia and industry, it becomes increasingly critical and challenging to effectively evaluate their capabilities. Existing evaluation methods can be broadly categorized into two types: manual evaluation and automatic evaluation. Manual evaluation, while comprehensive, is often costly and resource-intensive. Conversely, automatic evaluation offers greater scalability but is constrained by the limitations of its evaluation criteria (dominated by reference-based answers). To address these challenges, NTCIR-18 introduced the AEOLLM (Automatic Evaluation of LLMs) task, aiming to encourage reference-free evaluation methods that can overcome the limitations of existing approaches. In this paper, to enhance the evaluation performance of the AEOLLM task, we propose three key methods to improve the reference-free evaluation: 1) Multi-model Collaboration: Leveraging multiple LLMs to approximate human ratings across various subtasks; 2) Prompt Auto-optimization: Utilizing LLMs to iteratively refine the initial task prompts based on evaluation feedback from training samples; and 3) In-context Learning (ICL) Optimization: Based on the multi-task evaluation feedback, we train a specialized in-context example retrieval model, combined with a semantic relevance retrieval model, to jointly identify the most effective in-context learning examples. Experiments conducted on the final dataset demonstrate that our approach achieves superior performance on the AEOLLM task.

Paper Structure

This paper contains 15 sections, 6 equations, 3 figures, 1 table.

Figures (3)

  • Figure 1: In-context Learning (ICL) Optimization for reference-free evaluation.
  • Figure 2: Multi-model Collaboration for reference-free evaluation.
  • Figure 3: The optimized prompt for the summary generation task.