Table of Contents
Fetching ...

LFED: A Literary Fiction Evaluation Dataset for Large Language Models

Linhao Yu, Qun Liu, Deyi Xiong

TL;DR

LFED addresses the challenge of evaluating LLMs on long, narrative Chinese literature by constructing a dedicated dataset of 95 fictions and 1,304 questions across eight categories. It combines a rigorous data-sourcing pipeline from Douban, a well-defined eight-category question taxonomy, crowdsourced question generation with expert validation, and extensive zero-shot and few-shot evaluations of multiple state-of-the-art models in Chinese. The findings show that even top models like ChatGPT struggle with long-fiction comprehension, with only about 57% accuracy in zero-shot, highlighting the need for improved long-context reasoning and targeted benchmarks. The dataset, its methodology, and the analyses provide a valuable resource for understanding narrative understanding in LLMs and guiding future improvements in long-document QA and evaluation. LFED’s public availability enables researchers and practitioners to benchmark progress in literary fiction understanding and long-context reasoning of LLMs.

Abstract

The rapid evolution of large language models (LLMs) has ushered in the need for comprehensive assessments of their performance across various dimensions. In this paper, we propose LFED, a Literary Fiction Evaluation Dataset, which aims to evaluate the capability of LLMs on the long fiction comprehension and reasoning. We collect 95 literary fictions that are either originally written in Chinese or translated into Chinese, covering a wide range of topics across several centuries. We define a question taxonomy with 8 question categories to guide the creation of 1,304 questions. Additionally, we conduct an in-depth analysis to ascertain how specific attributes of literary fictions (e.g., novel types, character numbers, the year of publication) impact LLM performance in evaluations. Through a series of experiments with various state-of-the-art LLMs, we demonstrate that these models face considerable challenges in effectively addressing questions related to literary fictions, with ChatGPT reaching only 57.08% under the zero-shot setting. The dataset will be publicly available at https://github.com/tjunlp-lab/LFED.git

LFED: A Literary Fiction Evaluation Dataset for Large Language Models

TL;DR

LFED addresses the challenge of evaluating LLMs on long, narrative Chinese literature by constructing a dedicated dataset of 95 fictions and 1,304 questions across eight categories. It combines a rigorous data-sourcing pipeline from Douban, a well-defined eight-category question taxonomy, crowdsourced question generation with expert validation, and extensive zero-shot and few-shot evaluations of multiple state-of-the-art models in Chinese. The findings show that even top models like ChatGPT struggle with long-fiction comprehension, with only about 57% accuracy in zero-shot, highlighting the need for improved long-context reasoning and targeted benchmarks. The dataset, its methodology, and the analyses provide a valuable resource for understanding narrative understanding in LLMs and guiding future improvements in long-document QA and evaluation. LFED’s public availability enables researchers and practitioners to benchmark progress in literary fiction understanding and long-context reasoning of LLMs.

Abstract

The rapid evolution of large language models (LLMs) has ushered in the need for comprehensive assessments of their performance across various dimensions. In this paper, we propose LFED, a Literary Fiction Evaluation Dataset, which aims to evaluate the capability of LLMs on the long fiction comprehension and reasoning. We collect 95 literary fictions that are either originally written in Chinese or translated into Chinese, covering a wide range of topics across several centuries. We define a question taxonomy with 8 question categories to guide the creation of 1,304 questions. Additionally, we conduct an in-depth analysis to ascertain how specific attributes of literary fictions (e.g., novel types, character numbers, the year of publication) impact LLM performance in evaluations. Through a series of experiments with various state-of-the-art LLMs, we demonstrate that these models face considerable challenges in effectively addressing questions related to literary fictions, with ChatGPT reaching only 57.08% under the zero-shot setting. The dataset will be publicly available at https://github.com/tjunlp-lab/LFED.git
Paper Structure (14 sections, 4 figures, 7 tables)

This paper contains 14 sections, 4 figures, 7 tables.

Figures (4)

  • Figure 1: The overall pipeline for collecting questions in LFED.
  • Figure 2: Decision-tree-style Illustration of the question taxonomy. Green arrows denote yes, while red arrows indicate no.
  • Figure 3: Results on different novel attributions under the zero- and few-shot setting. The suffixes -sp and -lp in the model name represent short prompt and long prompt respectively. The left two subfigures demonstrate results on different range of chatacter numbers under the zero- and few-shot setting respectively. The right two subfigures demonstrate results on different range of publish years under the zero- and few-shot setting respectively.
  • Figure 4: Results on different novel types under the zero- and few-shot setting. The suffixes -sp and -lp in the model name represent short prompt and long prompt respectively. The top figure shows zero-shot results while the bottom one demonstrates few-shot results.