Table of Contents
Fetching ...

POEMetric: The Last Stanza of Humanity

Bingru Li, Han Wang, Hazel Wilkinson

Abstract

Large Language Models (LLMs) can compose poetry, but how far are they from human poets? In this paper, we introduce POEMetric, the first comprehensive framework for poetry evaluation, examining 1) basic instruction-following abilities in generating poems according to a certain form and theme, 2) advanced abilities of showing creativity, lexical diversity, and idiosyncrasy, evoking emotional resonance, and using imagery and literary devices, and 3) general appraisal of the overall poem quality and estimation of authorship. We curated a human poem dataset - 203 English poems of 7 fixed forms annotated with meter, rhyme patterns and themes - and experimented with 30 LLMs for poetry generation based on the same forms and themes of the human data, totaling 6,090 LLM poems. Based on POEMetric, we assessed the performance of both human poets and LLMs through rule-based evaluation and LLM-as-a-judge, whose results were validated by human experts. Results show that, though the top model achieved high form accuracy (4.26 out of 5.00, with Gemini-2.5-Pro as a judge; same below) and theme alignment (4.99), all models failed to reach the same level of advanced abilities as human poets, who achieved unparalleled creativity (4.02), idiosyncrasy (3.95), emotional resonance (4.06), and skillful use of imagery (4.49) and literary devices (4.67). Humans also defeated the best-performing LLM in overall poem quality (4.22 vs. 3.20). As such, poetry generation remains a formidable challenge for LLMs. Data and codes are released at https://github.com/Bingru-Li/POEMetric.

POEMetric: The Last Stanza of Humanity

Abstract

Large Language Models (LLMs) can compose poetry, but how far are they from human poets? In this paper, we introduce POEMetric, the first comprehensive framework for poetry evaluation, examining 1) basic instruction-following abilities in generating poems according to a certain form and theme, 2) advanced abilities of showing creativity, lexical diversity, and idiosyncrasy, evoking emotional resonance, and using imagery and literary devices, and 3) general appraisal of the overall poem quality and estimation of authorship. We curated a human poem dataset - 203 English poems of 7 fixed forms annotated with meter, rhyme patterns and themes - and experimented with 30 LLMs for poetry generation based on the same forms and themes of the human data, totaling 6,090 LLM poems. Based on POEMetric, we assessed the performance of both human poets and LLMs through rule-based evaluation and LLM-as-a-judge, whose results were validated by human experts. Results show that, though the top model achieved high form accuracy (4.26 out of 5.00, with Gemini-2.5-Pro as a judge; same below) and theme alignment (4.99), all models failed to reach the same level of advanced abilities as human poets, who achieved unparalleled creativity (4.02), idiosyncrasy (3.95), emotional resonance (4.06), and skillful use of imagery (4.49) and literary devices (4.67). Humans also defeated the best-performing LLM in overall poem quality (4.22 vs. 3.20). As such, poetry generation remains a formidable challenge for LLMs. Data and codes are released at https://github.com/Bingru-Li/POEMetric.

Paper Structure

This paper contains 45 sections, 1 equation, 24 figures, 3 tables, 1 algorithm.

Figures (24)

  • Figure 1: POEMetric. It comprises 10 metrics, including 1) basic instruction-following abilities (form accuracy and theme alignment), 2) advanced creative abilities (creativity, lexical diversity, idiosyncrasy, emotional resonance, and use of literary devices and imagery), and 3) general appraisal (overall poem quality and authorship estimation). Human- and LLM-authored poems are compared through rule-based evaluation and LLM-as-a-judge, whose results are validated by human experts.
  • Figure 2: An example of the human poem data and the generation prompt for LLMs. On the left are the related data annotated about a poem, including author, title, poem content, source, form, meter pattern, rhyme pattern, theme, and imagery. Based on these data, the prompt for LLMs to generate poems is curated in the template on the right.
  • Figure 3: A showcase of the poems by DeepSeek-R1 (Poem A) and a human poet (Poem B) in response to the same prompt. The bar charts show their POEMetric scores judged by Gemini-2.5-Pro.
  • Figure 4: Chain-of-Thought (CoT) process from DeepSeek-R1 for the poem generation. The model explicitly breaks down the prompt, plans the thematic progression stanza by stanza, brainstorms mter and rhymes, and attempts to strategically insert literary devices.
  • Figure 5: Rule-based evaluation results. LLMs were able to achieve high form accuracy and MATTR. However, their poems were highly repetitive compared with the original human poems.
  • ...and 19 more figures