CCiV: A Benchmark for Structure, Rhythm and Quality in LLM-Generated Chinese \textit{Ci} Poetry

Shangqing Zhao; Yupei Ren; Yuhao Zhou; Xiaopeng Bai; Man Lan

CCiV: A Benchmark for Structure, Rhythm and Quality in LLM-Generated Chinese \textit{Ci} Poetry

Shangqing Zhao, Yupei Ren, Yuhao Zhou, Xiaopeng Bai, Man Lan

TL;DR

This work introduces CCiV, a benchmark designed to assess LLM-generated classical Chinese poetry across these three dimensions: structure, rhythm, and quality, and shows that form-aware prompting can improve structural and tonal control for stronger models, while potentially degrading weaker ones.

Abstract

The generation of classical Chinese \textit{Ci} poetry, a form demanding a sophisticated blend of structural rigidity, rhythmic harmony, and artistic quality, poses a significant challenge for large language models (LLMs). To systematically evaluate and advance this capability, we introduce \textbf{C}hinese \textbf{Ci}pai \textbf{V}ariants (\textbf{CCiV}), a benchmark designed to assess LLM-generated \textit{Ci} poetry across these three dimensions: structure, rhythm, and quality. Our evaluation of 17 LLMs on 30 \textit{Cipai} reveals two critical phenomena: models frequently generate valid but unexpected historical variants of a poetic form, and adherence to tonal patterns is substantially harder than structural rules. We further show that form-aware prompting can improve structural and tonal control for stronger models, while potentially degrading weaker ones. Finally, we observe weak and inconsistent alignment between formal correctness and literary quality in our sample. CCiV highlights the need for variant-aware evaluation and more holistic constrained creative generation methods.

CCiV: A Benchmark for Structure, Rhythm and Quality in LLM-Generated Chinese \textit{Ci} Poetry

TL;DR

Abstract

Paper Structure (24 sections, 3 equations, 11 figures, 4 tables)

This paper contains 24 sections, 3 equations, 11 figures, 4 tables.

Introduction
Methodology
Data Collection and Preparation
Experimental Setup
Model Selection
Prompting Conditions
Decoding and Sampling
Evaluation Metrics
Results and Analysis
Baseline Performance and Variant Generation
The Challenge of Tonal Adherence
Improving Control with Form-Aware Prompting
Qualitative Analysis
Discussion
Conclusion
...and 9 more sections

Figures (11)

Figure 1: Prompting strategies for Ci poetry generation: direct prompting versus form-aware prompting with explicit structural guidance.
Figure 2: Prompt template for evaluating Informativeness and Aesthetic metrics.
Figure 3: Heatmap of structural accuracy against standard forms under direct prompting. Each row represents a model (ordered by overall performance), and each column represents a Cipai form (ordered by character count, right-to-left: 116 to 27). Darker cells indicate higher accuracy. Two patterns emerge: (1) better performance on shorter forms (left) than longer forms (right), and (2) stark performance gaps between model families. Anomalous low-accuracy columns (e.g., C27, C28) reflect the variant generation phenomenon.
Figure 4: Direct prompt example (zero-shot). No additional structural guidance is provided. Text in green is example variable and text in red is example output.
Figure 5: Form-aware prompt example. Explicit structural guidance is provided. Text in green is example variable and text in red is example output.
...and 6 more figures

CCiV: A Benchmark for Structure, Rhythm and Quality in LLM-Generated Chinese \textit{Ci} Poetry

TL;DR

Abstract

CCiV: A Benchmark for Structure, Rhythm and Quality in LLM-Generated Chinese \textit{Ci} Poetry

Authors

TL;DR

Abstract

Table of Contents

Figures (11)