Table of Contents
Fetching ...

Describing Images $\textit{Fast and Slow}$: Quantifying and Predicting the Variation in Human Signals during Visuo-Linguistic Processes

Ece Takmaz, Sandro Pezzelle, Raquel Fernández

TL;DR

The study investigates how image properties influence human visuo-linguistic signals during image description and whether pretrained vision encoders can capture this variation. Using the Dutch DIDEC corpus with concurrent eye-tracking, it quantifies variation across speech onsets, starting points, descriptions, and gaze, revealing significant correlations among these signals. A similarity-based probing approach with CLIP, ViT, and RandCLIP shows that pretrained encoders capture some image-driven variation but only weakly to moderately, with stronger signals aligned to training objectives. These findings highlight gaps in current multimodal models and motivate incorporating human signals into data collection and model training to better align machine outputs with human processing.

Abstract

There is an intricate relation between the properties of an image and how humans behave while describing the image. This behavior shows ample variation, as manifested in human signals such as eye movements and when humans start to describe the image. Despite the value of such signals of visuo-linguistic variation, they are virtually disregarded in the training of current pretrained models, which motivates further investigation. Using a corpus of Dutch image descriptions with concurrently collected eye-tracking data, we explore the nature of the variation in visuo-linguistic signals, and find that they correlate with each other. Given this result, we hypothesize that variation stems partly from the properties of the images, and explore whether image representations encoded by pretrained vision encoders can capture such variation. Our results indicate that pretrained models do so to a weak-to-moderate degree, suggesting that the models lack biases about what makes a stimulus complex for humans and what leads to variations in human outputs.

Describing Images $\textit{Fast and Slow}$: Quantifying and Predicting the Variation in Human Signals during Visuo-Linguistic Processes

TL;DR

The study investigates how image properties influence human visuo-linguistic signals during image description and whether pretrained vision encoders can capture this variation. Using the Dutch DIDEC corpus with concurrent eye-tracking, it quantifies variation across speech onsets, starting points, descriptions, and gaze, revealing significant correlations among these signals. A similarity-based probing approach with CLIP, ViT, and RandCLIP shows that pretrained encoders capture some image-driven variation but only weakly to moderately, with stronger signals aligned to training objectives. These findings highlight gaps in current multimodal models and motivate incorporating human signals into data collection and model training to better align machine outputs with human processing.

Abstract

There is an intricate relation between the properties of an image and how humans behave while describing the image. This behavior shows ample variation, as manifested in human signals such as eye movements and when humans start to describe the image. Despite the value of such signals of visuo-linguistic variation, they are virtually disregarded in the training of current pretrained models, which motivates further investigation. Using a corpus of Dutch image descriptions with concurrently collected eye-tracking data, we explore the nature of the variation in visuo-linguistic signals, and find that they correlate with each other. Given this result, we hypothesize that variation stems partly from the properties of the images, and explore whether image representations encoded by pretrained vision encoders can capture such variation. Our results indicate that pretrained models do so to a weak-to-moderate degree, suggesting that the models lack biases about what makes a stimulus complex for humans and what leads to variations in human outputs.
Paper Structure (37 sections, 25 figures, 4 tables)

This paper contains 37 sections, 25 figures, 4 tables.

Figures (25)

  • Figure 2: An image with its variation scores, a subset of its descriptions (along with the English translations in parentheses), and the eye movements of a single participant. In the descriptions, the words in boldface indicate the starting points in Dutch and their equivalents in English.
  • Figure 3: Spearman's correlation coefficients between the mean onsets per image (Onset), the variation in starting points (Starting), BLEU-2-based variation in full descriptions (Description), and the variation in gaze (Gaze) in the full dataset. Since higher BLEU scores mean less variation unlike the trends in the other measures, we utilize $1-BLEU$ for better interpretability. All of the correlations are significant, $p < .001$.
  • Figure 7: Distributions of onset means and SDs for the images in the whole dataset.
  • Figure 8: Correlation between mean onset and BLEU-2.
  • Figure : Min: 1.69 sec
  • ...and 20 more figures