Self-Evaluation of Large Language Model based on Glass-box Features

Hui Huang; Yingqi Qu; Jing Liu; Muyun Yang; Bing Xu; Tiejun Zhao; Wenpeng Lu

Self-Evaluation of Large Language Model based on Glass-box Features

Hui Huang, Yingqi Qu, Jing Liu, Muyun Yang, Bing Xu, Tiejun Zhao, Wenpeng Lu

TL;DR

This study investigates various glass-box feature groups and discovered that the softmax distribution serves as a reliable quality indicator for self-evaluation of LLMs using glass-box features.

Abstract

The proliferation of open-source Large Language Models (LLMs) underscores the pressing need for evaluation methods. Existing works primarily rely on external evaluators, focusing on training and prompting strategies. However, a crucial aspect, model-aware glass-box features, is overlooked. In this study, we explore the utility of glass-box features under the scenario of self-evaluation, namely applying an LLM to evaluate its own output. We investigate various glass-box feature groups and discovered that the softmax distribution serves as a reliable quality indicator for self-evaluation. Experimental results on public benchmarks validate the feasibility of self-evaluation of LLMs using glass-box features.

Self-Evaluation of Large Language Model based on Glass-box Features

TL;DR

This study investigates various glass-box feature groups and discovered that the softmax distribution serves as a reliable quality indicator for self-evaluation of LLMs using glass-box features.

Abstract

Paper Structure (17 sections, 9 equations, 8 figures, 2 tables)

This paper contains 17 sections, 9 equations, 8 figures, 2 tables.

Introduction
Glass-box Features for Self-Evaluation
Softmax Distribution
Uncertainty Estimation
Attention Distribution
Self-Evaluation with Reference
In-Context Illustration
Probability Calibration
Experiments
Set-up
Main Results
Self-Evaluation with Reference
Conclusion
Appendix
Prompt Pool for Prompt-based Ensemble
...and 2 more sections

Figures (8)

Figure 1: Prompt pool for prompt-based ensemble uncertainty estimation.
Figure 2: Prompt format with in-context illustration. The shaded part is the illustration with reference.
Figure 3: Prompt template for GPT4 and GPT-3.5-Turbo applied for single-turn evaluation.
Figure 4: Prompt template for GPT4 and GPT-3.5-Turbo applied for multi-turn evaluation.
Figure 5: Prompt template for Auto-J applied for single-turn evaluation.
...and 3 more figures

Self-Evaluation of Large Language Model based on Glass-box Features

TL;DR

Abstract

Self-Evaluation of Large Language Model based on Glass-box Features

Authors

TL;DR

Abstract

Table of Contents

Figures (8)