Composition and Deformance: Measuring Imageability with a Text-to-Image Model

Si Wu; David A. Smith

Composition and Deformance: Measuring Imageability with a Text-to-Image Model

Si Wu, David A. Smith

TL;DR

This work addresses the challenge of quantifying imageability beyond isolated words by leveraging text-to-image generation. It introduces two metrics, $aveCLIP$ and $imgSim$, derived from outputs of a open-source T2I model (DALL•E mini), and evaluates them on both word- and sentence-level data drawn from MRC and three connected-text corpora (poems, captions, news). The findings show meaningful correlations with human judgments for isolated words ($r$ around $0.54$ for $aveCLIP$ and $0.43$ for $imgSim$) and reveal varying but generally informative sensitivity to compositional changes in connected text, with deformances capturing imagery shifts that bag-of-words methods miss. The results underscore the potential of image-based measures for studying imageability and compositionality in NLP, while also highlighting dependence on dataset type, model training data, and evaluation noise; future work should compare multiple text-to-image models and expand human judgments to stabilize correlations.

Abstract

Although psycholinguists and psychologists have long studied the tendency of linguistic strings to evoke mental images in hearers or readers, most computational studies have applied this concept of imageability only to isolated words. Using recent developments in text-to-image generation models, such as DALLE mini, we propose computational methods that use generated images to measure the imageability of both single English words and connected text. We sample text prompts for image generation from three corpora: human-generated image captions, news article sentences, and poem lines. We subject these prompts to different deformances to examine the model's ability to detect changes in imageability caused by compositional change. We find high correlation between the proposed computational measures of imageability and human judgments of individual words. We also find the proposed measures more consistently respond to changes in compositionality than baseline approaches. We discuss possible effects of model training and implications for the study of compositionality in text-to-image models.

Composition and Deformance: Measuring Imageability with a Text-to-Image Model

TL;DR

This work addresses the challenge of quantifying imageability beyond isolated words by leveraging text-to-image generation. It introduces two metrics,

and

, derived from outputs of a open-source T2I model (DALL•E mini), and evaluates them on both word- and sentence-level data drawn from MRC and three connected-text corpora (poems, captions, news). The findings show meaningful correlations with human judgments for isolated words (

around

for

and

for

) and reveal varying but generally informative sensitivity to compositional changes in connected text, with deformances capturing imagery shifts that bag-of-words methods miss. The results underscore the potential of image-based measures for studying imageability and compositionality in NLP, while also highlighting dependence on dataset type, model training data, and evaluation noise; future work should compare multiple text-to-image models and expand human judgments to stabilize correlations.

Abstract

Paper Structure (22 sections, 9 figures, 5 tables)

This paper contains 22 sections, 9 figures, 5 tables.

Introduction
Related work
Datasets
Connected text datasets
Psycholinguistics databases
Methods
Model
Measurements
Measuring isolated word's imageability
The case of familiarity of MRC vocabulary
Connected text and compositionality
Preprocessing
Measuring imageability
Deformances
Human judgment
...and 7 more sections

Figures (9)

Figure 1: Generated images of words with high imageability ratings have more visual homogeneity comparing to the ones with low imageability ratings.
Figure 2: X-axis is MRC imageability human rating. Y-axis is $imgSim$, and each dot is a word colored by its $aveCLIP$.
Figure 3: Average CLIP score vs. average pairwise image embedding cosine similarity. Each dot is a MRC word.
Figure 4: Percent change between lines with the top 10% and bottom 10% {aveCLIP, imgSim} scores and their associated deformed text.
Figure 5: The original poem vs. its replaced noun version. Displaying only the changed lines.
...and 4 more figures

Composition and Deformance: Measuring Imageability with a Text-to-Image Model

TL;DR

Abstract

Composition and Deformance: Measuring Imageability with a Text-to-Image Model

Authors

TL;DR

Abstract

Table of Contents

Figures (9)