Poetry in Pixels: Prompt Tuning for Poem Image Generation via Diffusion Models
Sofia Jamil, Bollampalli Areen Reddy, Raghvendra Kumar, Sriparna Saha, K J Joseph, Koustava Goswami
TL;DR
PoemToPixel presents a two-stage pipeline that visualizes poetry by first summarizing poems into $S_i=f_{summ}(P_i)$ and then extracting core elements $E_i=f_{KeyExtraction}(S_i)$ to craft targeted diffusion prompts. The PoeKey algorithm retrieves emotions, visual elements, and themes, which are converted into concise instructions for image generation using SDXL Turbo diffusion, with prompt tuning refined through human feedback. Evaluations on PoemSum and MiniPo demonstrate that the combination of summarization and key-element extraction yields superior alignment between poems and their images, outperforming baselines in both automatic (ITM/ITC) and human ratings. The work introduces MiniPo, a 1001-item multimodal nursery rhyme dataset, and shows promise for richer artistic representations of poetry, while acknowledging limitations in handling multiple meanings and language scope and noting ethical considerations around diffusion-model biases.
Abstract
The task of text-to-image generation has encountered significant challenges when applied to literary works, especially poetry. Poems are a distinct form of literature, with meanings that frequently transcend beyond the literal words. To address this shortcoming, we propose a PoemToPixel framework designed to generate images that visually represent the inherent meanings of poems. Our approach incorporates the concept of prompt tuning in our image generation framework to ensure that the resulting images closely align with the poetic content. In addition, we propose the PoeKey algorithm, which extracts three key elements in the form of emotions, visual elements, and themes from poems to form instructions which are subsequently provided to a diffusion model for generating corresponding images. Furthermore, to expand the diversity of the poetry dataset across different genres and ages, we introduce MiniPo, a novel multimodal dataset comprising 1001 children's poems and images. Leveraging this dataset alongside PoemSum, we conducted both quantitative and qualitative evaluations of image generation using our PoemToPixel framework. This paper demonstrates the effectiveness of our approach and offers a fresh perspective on generating images from literary sources.
