Table of Contents
Fetching ...

E-Bench: Towards Evaluating the Ease-of-Use of Large Language Models

Zhenyu Zhang, Bingguang Hao, Jinpeng Li, Zekai Zhang, Dongyan Zhao

TL;DR

The paper introduces E-Bench, a benchmark to quantify the ease-of-use of large language models by simulating human-style prompt perturbations—paraphrasing, simplification, colloquialism, and typing errors—on AlpacaEval and measuring performance drops against a GPT-4 reference. Through experiments on six representative LLMs, it finds that larger models generally exhibit improved robustness to synonymous perturbations, but typing perturbations show no consistent scaling behavior, and model-specific strengths emerge (GPTs for paraphrasing, Vicuna for colloquialism). An error analysis reveals factors like 'challenge', 'safety', and 'refusal' as major failure modes, particularly for Llama 2-chat 7b, underscoring gaps in user-friendliness. The work emphasizes the influence of training data on ease-of-use and provides an open benchmark to drive future improvements toward more robust and usable generative systems.

Abstract

Most large language models (LLMs) are sensitive to prompts, and another synonymous expression or a typo may lead to unexpected results for the model. Composing an optimal prompt for a specific demand lacks theoretical support and relies entirely on human experimentation, which poses a considerable obstacle to popularizing generative artificial intelligence. However, there is no systematic analysis of the stability of LLMs in resisting prompt perturbations in real-world scenarios. In this work, we propose to evaluate the ease-of-use of LLMs and construct E-Bench, simulating the actual situation of human use from synonymous perturbation (including paraphrasing, simplification, and colloquialism) and typographical perturbation (such as typing). On this basis, we also discuss the combination of these two types of perturbation and analyze the main reasons for performance degradation. Experimental results indicate that with the increase of model size, although the ease-of-use are significantly improved, there is still a long way to go to build a sufficiently user-friendly model.

E-Bench: Towards Evaluating the Ease-of-Use of Large Language Models

TL;DR

The paper introduces E-Bench, a benchmark to quantify the ease-of-use of large language models by simulating human-style prompt perturbations—paraphrasing, simplification, colloquialism, and typing errors—on AlpacaEval and measuring performance drops against a GPT-4 reference. Through experiments on six representative LLMs, it finds that larger models generally exhibit improved robustness to synonymous perturbations, but typing perturbations show no consistent scaling behavior, and model-specific strengths emerge (GPTs for paraphrasing, Vicuna for colloquialism). An error analysis reveals factors like 'challenge', 'safety', and 'refusal' as major failure modes, particularly for Llama 2-chat 7b, underscoring gaps in user-friendliness. The work emphasizes the influence of training data on ease-of-use and provides an open benchmark to drive future improvements toward more robust and usable generative systems.

Abstract

Most large language models (LLMs) are sensitive to prompts, and another synonymous expression or a typo may lead to unexpected results for the model. Composing an optimal prompt for a specific demand lacks theoretical support and relies entirely on human experimentation, which poses a considerable obstacle to popularizing generative artificial intelligence. However, there is no systematic analysis of the stability of LLMs in resisting prompt perturbations in real-world scenarios. In this work, we propose to evaluate the ease-of-use of LLMs and construct E-Bench, simulating the actual situation of human use from synonymous perturbation (including paraphrasing, simplification, and colloquialism) and typographical perturbation (such as typing). On this basis, we also discuss the combination of these two types of perturbation and analyze the main reasons for performance degradation. Experimental results indicate that with the increase of model size, although the ease-of-use are significantly improved, there is still a long way to go to build a sufficiently user-friendly model.
Paper Structure (13 sections, 5 figures, 4 tables)

This paper contains 13 sections, 5 figures, 4 tables.

Figures (5)

  • Figure 1: The prompt perturbations in E-Bench, which simulate the actual situations of humans using LLMs.
  • Figure 2: The error analysis of Llama 2-chat 7b. We also provide the amount that improves after perturbation as a reference for evaluation "fluctuation".
  • Figure 3: Attention-head view of Llama 2-chat 7b on paraphrasing case. The left panel shows the attention of keyword "address" in original input, and the right panel shows the attention after perturbed by paraphrasing.
  • Figure 4: Attention-head view of Llama 2-chat 7b on colloquialism case. The left panel shows the attention of keyword "cookies" in original input, and the right panel shows the attention after perturbed by colloquialism.
  • Figure 5: Attention-head view of Llama 2-chat 7b on typing attack case. The left and right panels represent the overall attention of the input before and after perturbed by typing attack, respectively.