Evaluation of Instruction-Following Ability for Large Language Models on Story-Ending Generation

Rem Hida; Junki Ohmura; Toshiyuki Sekiya

Evaluation of Instruction-Following Ability for Large Language Models on Story-Ending Generation

Rem Hida, Junki Ohmura, Toshiyuki Sekiya

TL;DR

This paper proposes an automatic evaluation pipeline that utilizes a machine reading comprehension (MRC) model to determine whether the generated story-ending reflects instruction, and demonstrates that the proposed metric aligns with human evaluation.

Abstract

Instruction-tuned Large Language Models (LLMs) have achieved remarkable performance across various benchmark tasks. While providing instructions to LLMs for guiding their generations is user-friendly, assessing their instruction-following capabilities is still unclarified due to a lack of evaluation metrics. In this paper, we focus on evaluating the instruction-following ability of LLMs in the context of story-ending generation, which requires diverse and context-specific instructions. We propose an automatic evaluation pipeline that utilizes a machine reading comprehension (MRC) model to determine whether the generated story-ending reflects instruction. Our findings demonstrate that our proposed metric aligns with human evaluation. Furthermore, our experiments confirm that recent open-source LLMs can achieve instruction-following performance close to GPT-3.5, as assessed through automatic evaluation.

Evaluation of Instruction-Following Ability for Large Language Models on Story-Ending Generation

TL;DR

Abstract

Paper Structure (23 sections, 2 equations, 4 figures, 8 tables)

This paper contains 23 sections, 2 equations, 4 figures, 8 tables.

Introduction
Related Work
Instruction-Following Ability
Application of Instruction-Following Ability
Conditional Story Generation
Instruction-Following Ability on Story-Ending Generation
Task Setting
Dataset
Proposed Metric for Instruction-Following
Evaluation Experiments
Model
Generation Setting
Human Evaluation
Automatic Evaluation
Conclusion
...and 8 more sections

Figures (4)

Figure 1: Overview of instruction-following story-ending generation: Conditioning by instruction texts produces different endings such as dangerous and advisory endings. E.g, the examples are from Possible Stories ashida-sugawara-2022-possible.
Figure 2: Evaluation pipeline of IFSM (Instruction Following Score from the MRC model)
Figure 3: Instruction for Evaluator.
Figure 4: Evaluation Page.

Evaluation of Instruction-Following Ability for Large Language Models on Story-Ending Generation

TL;DR

Abstract

Evaluation of Instruction-Following Ability for Large Language Models on Story-Ending Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (4)