Table of Contents
Fetching ...

CERD: A Comprehensive Chinese Rhetoric Dataset for Rhetorical Understanding and Generation in Essays

Nuowei Liu, Xinhao Chen, Hongyi Wu, Changzhi Sun, Man Lan, Yuanbin Wu, Xiaopeng Bai, Shaoguang Mao, Yan Xia

TL;DR

CERD is a manually annotated and comprehensive Chinese rhetoric dataset with five interrelated sub-tasks that aids in understanding various rhetorical devices, recognizing corresponding rhetorical components, and generating rhetorical sentences under given conditions, thereby improving the author's writing proficiency and language usage skills.

Abstract

Existing rhetorical understanding and generation datasets or corpora primarily focus on single coarse-grained categories or fine-grained categories, neglecting the common interrelations between different rhetorical devices by treating them as independent sub-tasks. In this paper, we propose the Chinese Essay Rhetoric Dataset (CERD), consisting of 4 commonly used coarse-grained categories including metaphor, personification, hyperbole and parallelism and 23 fine-grained categories across both form and content levels. CERD is a manually annotated and comprehensive Chinese rhetoric dataset with five interrelated sub-tasks. Unlike previous work, our dataset aids in understanding various rhetorical devices, recognizing corresponding rhetorical components, and generating rhetorical sentences under given conditions, thereby improving the author's writing proficiency and language usage skills. Extensive experiments are conducted to demonstrate the interrelations between multiple tasks in CERD, as well as to establish a benchmark for future research on rhetoric. The experimental results indicate that Large Language Models achieve the best performance across most tasks, and jointly fine-tuning with multiple tasks further enhances performance.

CERD: A Comprehensive Chinese Rhetoric Dataset for Rhetorical Understanding and Generation in Essays

TL;DR

CERD is a manually annotated and comprehensive Chinese rhetoric dataset with five interrelated sub-tasks that aids in understanding various rhetorical devices, recognizing corresponding rhetorical components, and generating rhetorical sentences under given conditions, thereby improving the author's writing proficiency and language usage skills.

Abstract

Existing rhetorical understanding and generation datasets or corpora primarily focus on single coarse-grained categories or fine-grained categories, neglecting the common interrelations between different rhetorical devices by treating them as independent sub-tasks. In this paper, we propose the Chinese Essay Rhetoric Dataset (CERD), consisting of 4 commonly used coarse-grained categories including metaphor, personification, hyperbole and parallelism and 23 fine-grained categories across both form and content levels. CERD is a manually annotated and comprehensive Chinese rhetoric dataset with five interrelated sub-tasks. Unlike previous work, our dataset aids in understanding various rhetorical devices, recognizing corresponding rhetorical components, and generating rhetorical sentences under given conditions, thereby improving the author's writing proficiency and language usage skills. Extensive experiments are conducted to demonstrate the interrelations between multiple tasks in CERD, as well as to establish a benchmark for future research on rhetoric. The experimental results indicate that Large Language Models achieve the best performance across most tasks, and jointly fine-tuning with multiple tasks further enhances performance.
Paper Structure (41 sections, 1 equation, 13 figures, 15 tables)

This paper contains 41 sections, 1 equation, 13 figures, 15 tables.

Figures (13)

  • Figure 1: An excerpt from an essay illustrating four commonly used rhetorical devices. It is worth noting that a sentence can employ one or more rhetorical devices, or it can be a literal sentence.
  • Figure 2: An example of five sub-tasks in CERD. An overview of the five tasks is discussed in Section \ref{['sec:exp-view']}.
  • Figure 3: Distribution of fine-grained categories is illustrated in Figure (a) for form-level categories and in Figure (b) for content-level categories.
  • Figure 4: Case study on Rhetoric Classification Task, Form Classification Task and Content Classification Task. A mismatched mapping refers to a fine-grained category that does not belong to its predicted corresponding coarse-grained category.
  • Figure 5: Case study on Component Extraction Task. A mismatched mapping refers to the extracted rhetorical components that do not fully satisfy the requirements of the predicted corresponding fine-grained form-level category.
  • ...and 8 more figures