Table of Contents
Fetching ...

Poetry in Pixels: Prompt Tuning for Poem Image Generation via Diffusion Models

Sofia Jamil, Bollampalli Areen Reddy, Raghvendra Kumar, Sriparna Saha, K J Joseph, Koustava Goswami

TL;DR

PoemToPixel presents a two-stage pipeline that visualizes poetry by first summarizing poems into $S_i=f_{summ}(P_i)$ and then extracting core elements $E_i=f_{KeyExtraction}(S_i)$ to craft targeted diffusion prompts. The PoeKey algorithm retrieves emotions, visual elements, and themes, which are converted into concise instructions for image generation using SDXL Turbo diffusion, with prompt tuning refined through human feedback. Evaluations on PoemSum and MiniPo demonstrate that the combination of summarization and key-element extraction yields superior alignment between poems and their images, outperforming baselines in both automatic (ITM/ITC) and human ratings. The work introduces MiniPo, a 1001-item multimodal nursery rhyme dataset, and shows promise for richer artistic representations of poetry, while acknowledging limitations in handling multiple meanings and language scope and noting ethical considerations around diffusion-model biases.

Abstract

The task of text-to-image generation has encountered significant challenges when applied to literary works, especially poetry. Poems are a distinct form of literature, with meanings that frequently transcend beyond the literal words. To address this shortcoming, we propose a PoemToPixel framework designed to generate images that visually represent the inherent meanings of poems. Our approach incorporates the concept of prompt tuning in our image generation framework to ensure that the resulting images closely align with the poetic content. In addition, we propose the PoeKey algorithm, which extracts three key elements in the form of emotions, visual elements, and themes from poems to form instructions which are subsequently provided to a diffusion model for generating corresponding images. Furthermore, to expand the diversity of the poetry dataset across different genres and ages, we introduce MiniPo, a novel multimodal dataset comprising 1001 children's poems and images. Leveraging this dataset alongside PoemSum, we conducted both quantitative and qualitative evaluations of image generation using our PoemToPixel framework. This paper demonstrates the effectiveness of our approach and offers a fresh perspective on generating images from literary sources.

Poetry in Pixels: Prompt Tuning for Poem Image Generation via Diffusion Models

TL;DR

PoemToPixel presents a two-stage pipeline that visualizes poetry by first summarizing poems into and then extracting core elements to craft targeted diffusion prompts. The PoeKey algorithm retrieves emotions, visual elements, and themes, which are converted into concise instructions for image generation using SDXL Turbo diffusion, with prompt tuning refined through human feedback. Evaluations on PoemSum and MiniPo demonstrate that the combination of summarization and key-element extraction yields superior alignment between poems and their images, outperforming baselines in both automatic (ITM/ITC) and human ratings. The work introduces MiniPo, a 1001-item multimodal nursery rhyme dataset, and shows promise for richer artistic representations of poetry, while acknowledging limitations in handling multiple meanings and language scope and noting ethical considerations around diffusion-model biases.

Abstract

The task of text-to-image generation has encountered significant challenges when applied to literary works, especially poetry. Poems are a distinct form of literature, with meanings that frequently transcend beyond the literal words. To address this shortcoming, we propose a PoemToPixel framework designed to generate images that visually represent the inherent meanings of poems. Our approach incorporates the concept of prompt tuning in our image generation framework to ensure that the resulting images closely align with the poetic content. In addition, we propose the PoeKey algorithm, which extracts three key elements in the form of emotions, visual elements, and themes from poems to form instructions which are subsequently provided to a diffusion model for generating corresponding images. Furthermore, to expand the diversity of the poetry dataset across different genres and ages, we introduce MiniPo, a novel multimodal dataset comprising 1001 children's poems and images. Leveraging this dataset alongside PoemSum, we conducted both quantitative and qualitative evaluations of image generation using our PoemToPixel framework. This paper demonstrates the effectiveness of our approach and offers a fresh perspective on generating images from literary sources.
Paper Structure (22 sections, 7 figures, 11 tables)

This paper contains 22 sections, 7 figures, 11 tables.

Figures (7)

  • Figure 1: A framework of an iterative process with prompts refined based on feedback for improved summarization.
  • Figure 2: A framework of an iterative process of image prompts refined based on feedback.
  • Figure 3: An instance of image generation phases during instruction tuning.
  • Figure 4: A comparison of Poem to Image generation using SDXL Base Diffusion Model with different methods and PoemToPixel Approach on (a) PoemSum dataset (b) MiniPo ((a collection of Nursery Rhymes) dataset.
  • Figure 5: A comparison of poem to image generation using SDXL Base Diffusion Model for poem mentioned in Table \ref{['minipo_poems_only']} with different methods and PoemToPixel Approach
  • ...and 2 more figures