PoeTone: A Framework for Constrained Generation of Structured Chinese Songci with LLMs
Zhan Qu, Shuzhou Yuan, Michael Färber
TL;DR
This work tackles constrained generation of Songci by Cipai templates, a regime requiring strict structural, tonal, and rhyming rules. It introduces PoeTone, a complete pipeline with a Cipai constraint resource, a thematic canonical corpus, and a multi-faceted evaluation protocol that blends formal conformity, automated quality, human judgment, and probing. The authors benchmark 18 LLMs across five prompting strategies and propose a Generate-Critic architecture that uses automated rule-based feedback to guide fine-tuning, achieving up to 5.88% improvement in formal conformity for open-source models. The study provides actionable insights into the strengths and limitations of current LLMs for culturally significant, formally constrained text and offers a scalable approach for aligning models with symbolic, rule-based goals in structured domains.
Abstract
This paper presents a systematic investigation into the constrained generation capabilities of large language models (LLMs) in producing Songci, a classical Chinese poetry form characterized by strict structural, tonal, and rhyme constraints defined by Cipai templates. We first develop a comprehensive, multi-faceted evaluation framework that includes: (i) a formal conformity score, (ii) automated quality assessment using LLMs, (iii) human evaluation, and (iv) classification-based probing tasks. Using this framework, we evaluate the generative performance of 18 LLMs, including 3 proprietary models and 15 open-source models across 4 families, under five prompting strategies: zero-shot, one-shot, completion-based, instruction-based, and chain-of-thought. Finally, we propose a Generate-Critic architecture in which the evaluation framework functions as an automated critic. Leveraging the critic's feedback as a scoring function for best-of-N selection, we fine-tune 3 lightweight open-source LLMs via supervised fine-tuning (SFT), resulting in improvements of up to 5.88% in formal conformity. Our findings offer new insights into the generative strengths and limitations of LLMs in producing culturally significant and formally constrained literary texts.
