Table of Contents
Fetching ...

Poetry2Image: An Iterative Correction Framework for Images Generated from Chinese Classical Poetry

Jing Jiang, Yiran Ling, Binzhu Li, Pengxiang Li, Junming Piao, Yu Zhang

TL;DR

The paper tackles semantic misalignment and element loss when generating images from Chinese classical poetry by introducing Poetry2Image, a training-free iterative correction framework that uses external poetry data, LLM-based key element extraction, and automated image editing. It forms a closed loop where initial images generated from poem translations are refined through Open Vocabulary Detector feedback and LLM-guided edits, without model fine-tuning. Empirical results on 200-poem test sets across five image-generation models show substantial gains in elemental completeness (average around 70.63%, up 25.56%) and semantic consistency (around 80.09%), with notable improvements for several models (e.g., DALL-E). The approach is model-agnostic, reduces manual annotation, and provides a reference for non-fine-tuning enhancements to LLM-driven generation, with demonstrated applicability to multilingual poetry in preliminary tests.

Abstract

Text-to-image generation models often struggle with key element loss or semantic confusion in tasks involving Chinese classical poetry.Addressing this issue through fine-tuning models needs considerable training costs. Additionally, manual prompts for re-diffusion adjustments need professional knowledge. To solve this problem, we propose Poetry2Image, an iterative correction framework for images generated from Chinese classical poetry. Utilizing an external poetry dataset, Poetry2Image establishes an automated feedback and correction loop, which enhances the alignment between poetry and image through image generation models and subsequent re-diffusion modifications suggested by large language models (LLM). Using a test set of 200 sentences of Chinese classical poetry, the proposed method--when integrated with five popular image generation models--achieves an average element completeness of 70.63%, representing an improvement of 25.56% over direct image generation. In tests of semantic correctness, our method attains an average semantic consistency of 80.09%. The study not only promotes the dissemination of ancient poetry culture but also offers a reference for similar non-fine-tuning methods to enhance LLM generation.

Poetry2Image: An Iterative Correction Framework for Images Generated from Chinese Classical Poetry

TL;DR

The paper tackles semantic misalignment and element loss when generating images from Chinese classical poetry by introducing Poetry2Image, a training-free iterative correction framework that uses external poetry data, LLM-based key element extraction, and automated image editing. It forms a closed loop where initial images generated from poem translations are refined through Open Vocabulary Detector feedback and LLM-guided edits, without model fine-tuning. Empirical results on 200-poem test sets across five image-generation models show substantial gains in elemental completeness (average around 70.63%, up 25.56%) and semantic consistency (around 80.09%), with notable improvements for several models (e.g., DALL-E). The approach is model-agnostic, reduces manual annotation, and provides a reference for non-fine-tuning enhancements to LLM-driven generation, with demonstrated applicability to multilingual poetry in preliminary tests.

Abstract

Text-to-image generation models often struggle with key element loss or semantic confusion in tasks involving Chinese classical poetry.Addressing this issue through fine-tuning models needs considerable training costs. Additionally, manual prompts for re-diffusion adjustments need professional knowledge. To solve this problem, we propose Poetry2Image, an iterative correction framework for images generated from Chinese classical poetry. Utilizing an external poetry dataset, Poetry2Image establishes an automated feedback and correction loop, which enhances the alignment between poetry and image through image generation models and subsequent re-diffusion modifications suggested by large language models (LLM). Using a test set of 200 sentences of Chinese classical poetry, the proposed method--when integrated with five popular image generation models--achieves an average element completeness of 70.63%, representing an improvement of 25.56% over direct image generation. In tests of semantic correctness, our method attains an average semantic consistency of 80.09%. The study not only promotes the dissemination of ancient poetry culture but also offers a reference for similar non-fine-tuning methods to enhance LLM generation.
Paper Structure (20 sections, 1 equation, 7 figures, 7 tables, 2 algorithms)

This paper contains 20 sections, 1 equation, 7 figures, 7 tables, 2 algorithms.

Figures (7)

  • Figure 1: Direct text-based image generation often results in losing key elements in the image. Our method addresses this issue by implementing targeted image corrections, effectively capturing the semantics and artistic essence conveyed by the poem.
  • Figure 2: Automated iterative correction framework for images generated from poetry. Utilizing a pre-built poetry dataset, the process begins with the extraction of the poetry and generation of an initial image, followed by the implementation of a self-feedback image correction iteration loop. The loop functions by analyzing the semantics of the poem text and the image elements identified by Open Vocabulary Detector (OVD), utilizing LLM. It then outputs correction suggestions that guide the diffusion models for image editing, continuously providing feedback to progressively align the text semantics with the image semantics.
  • Figure 3: An illustration of the LLM Extractor, a key element extraction module. Upon retrieving the poem's translation and critical appreciation from the poetry database, these texts along with the system prompt are fed into the LLM. Subsequently, the LLM outputs the key elements contained in the poetry.
  • Figure 4: An example of the LLM Suggester, a module dedicated to modifying image bounding boxes. After conducting OVD-based element recognition to determine the existing bounding box, the translation, this bounding box, and the system prompt are inputted into the LLM. The LLM then adjusts the bounding box based on the semantic information in the translation, outputting the modified bounding box.
  • Figure 5: Image generation effect of the whole process evaluation. Peotry2Image enhances image generation quality for specialized texts like classical poetry and addresses core issues such as morpheme loss and semantic confusion. The poems corresponding to the images can be found in Appendix \ref{['sec:B']}.
  • ...and 2 more figures