Poetry in Pixels: Prompt Tuning for Poem Image Generation via Diffusion Models

Sofia Jamil; Bollampalli Areen Reddy; Raghvendra Kumar; Sriparna Saha; K J Joseph; Koustava Goswami

Poetry in Pixels: Prompt Tuning for Poem Image Generation via Diffusion Models

Sofia Jamil, Bollampalli Areen Reddy, Raghvendra Kumar, Sriparna Saha, K J Joseph, Koustava Goswami

TL;DR

PoemToPixel presents a two-stage pipeline that visualizes poetry by first summarizing poems into $S_i=f_{summ}(P_i)$ and then extracting core elements $E_i=f_{KeyExtraction}(S_i)$ to craft targeted diffusion prompts. The PoeKey algorithm retrieves emotions, visual elements, and themes, which are converted into concise instructions for image generation using SDXL Turbo diffusion, with prompt tuning refined through human feedback. Evaluations on PoemSum and MiniPo demonstrate that the combination of summarization and key-element extraction yields superior alignment between poems and their images, outperforming baselines in both automatic (ITM/ITC) and human ratings. The work introduces MiniPo, a 1001-item multimodal nursery rhyme dataset, and shows promise for richer artistic representations of poetry, while acknowledging limitations in handling multiple meanings and language scope and noting ethical considerations around diffusion-model biases.

Abstract

The task of text-to-image generation has encountered significant challenges when applied to literary works, especially poetry. Poems are a distinct form of literature, with meanings that frequently transcend beyond the literal words. To address this shortcoming, we propose a PoemToPixel framework designed to generate images that visually represent the inherent meanings of poems. Our approach incorporates the concept of prompt tuning in our image generation framework to ensure that the resulting images closely align with the poetic content. In addition, we propose the PoeKey algorithm, which extracts three key elements in the form of emotions, visual elements, and themes from poems to form instructions which are subsequently provided to a diffusion model for generating corresponding images. Furthermore, to expand the diversity of the poetry dataset across different genres and ages, we introduce MiniPo, a novel multimodal dataset comprising 1001 children's poems and images. Leveraging this dataset alongside PoemSum, we conducted both quantitative and qualitative evaluations of image generation using our PoemToPixel framework. This paper demonstrates the effectiveness of our approach and offers a fresh perspective on generating images from literary sources.

Poetry in Pixels: Prompt Tuning for Poem Image Generation via Diffusion Models

TL;DR

PoemToPixel presents a two-stage pipeline that visualizes poetry by first summarizing poems into

and then extracting core elements

to craft targeted diffusion prompts. The PoeKey algorithm retrieves emotions, visual elements, and themes, which are converted into concise instructions for image generation using SDXL Turbo diffusion, with prompt tuning refined through human feedback. Evaluations on PoemSum and MiniPo demonstrate that the combination of summarization and key-element extraction yields superior alignment between poems and their images, outperforming baselines in both automatic (ITM/ITC) and human ratings. The work introduces MiniPo, a 1001-item multimodal nursery rhyme dataset, and shows promise for richer artistic representations of poetry, while acknowledging limitations in handling multiple meanings and language scope and noting ethical considerations around diffusion-model biases.

Abstract

Paper Structure (22 sections, 7 figures, 11 tables)

This paper contains 22 sections, 7 figures, 11 tables.

Introduction
Related Works
Corpus
Proposed Methodology
Problem Statement
PoemToPixel Framework
Phase 1: Summarization Module
Key Element Extraction Unit
Instruction Generator
Experiment and Results
Phase 1: Summarization
Phase 2: Image Generation
Conclusion
Limitations
Ethical Consideration
...and 7 more sections

Figures (7)

Figure 1: A framework of an iterative process with prompts refined based on feedback for improved summarization.
Figure 2: A framework of an iterative process of image prompts refined based on feedback.
Figure 3: An instance of image generation phases during instruction tuning.
Figure 4: A comparison of Poem to Image generation using SDXL Base Diffusion Model with different methods and PoemToPixel Approach on (a) PoemSum dataset (b) MiniPo ((a collection of Nursery Rhymes) dataset.
Figure 5: A comparison of poem to image generation using SDXL Base Diffusion Model for poem mentioned in Table \ref{['minipo_poems_only']} with different methods and PoemToPixel Approach
...and 2 more figures

Poetry in Pixels: Prompt Tuning for Poem Image Generation via Diffusion Models

TL;DR

Abstract

Poetry in Pixels: Prompt Tuning for Poem Image Generation via Diffusion Models

Authors

TL;DR

Abstract

Table of Contents

Figures (7)