Table of Contents
Fetching ...

Surprisal reveals diversity gaps in image captioning and different scorers change the story

Nikolai Ilinykh, Simon Dobnik

TL;DR

This work addresses how to quantify linguistic diversity in image captioning by introducing surprisal variance as a diversity metric. It presents a dual-scorer framework, using (i) a caption-domain bi-/tri-gram LM and (ii) a general-domain GPT-2, to compute token-level surprisal $I(w_t) = - \log P_{\theta}(w_t \mid w_{<t})$ and compare human versus model captions on the MSCOCO Karpathy test split. The key finding is scorer-dependent: under the in-domain scorer, humans show about twice the surprisal variance of five state-of-the-art vision-language models; under the general-domain scorer, this pattern reverses, highlighting the complementarity of scoring perspectives. The study argues that future benchmarking must report surprisal-based diversity across multiple scorers to avoid misleading conclusions about human versus model variability and to better capture pragmatic diversity in captioning.

Abstract

We quantify linguistic diversity in image captioning with surprisal variance - the spread of token-level negative log-probabilities within a caption set. On the MSCOCO test set, we compare five state-of-the-art vision-and-language LLMs, decoded with greedy and nucleus sampling, to human captions. Measured with a caption-trained n-gram LM, humans display roughly twice the surprisal variance of models, but rescoring the same captions with a general-language model reverses the pattern. Our analysis introduces the surprisal-based diversity metric for image captioning. We show that relying on a single scorer can completely invert conclusions, thus, robust diversity evaluation must report surprisal under several scorers.

Surprisal reveals diversity gaps in image captioning and different scorers change the story

TL;DR

This work addresses how to quantify linguistic diversity in image captioning by introducing surprisal variance as a diversity metric. It presents a dual-scorer framework, using (i) a caption-domain bi-/tri-gram LM and (ii) a general-domain GPT-2, to compute token-level surprisal and compare human versus model captions on the MSCOCO Karpathy test split. The key finding is scorer-dependent: under the in-domain scorer, humans show about twice the surprisal variance of five state-of-the-art vision-language models; under the general-domain scorer, this pattern reverses, highlighting the complementarity of scoring perspectives. The study argues that future benchmarking must report surprisal-based diversity across multiple scorers to avoid misleading conclusions about human versus model variability and to better capture pragmatic diversity in captioning.

Abstract

We quantify linguistic diversity in image captioning with surprisal variance - the spread of token-level negative log-probabilities within a caption set. On the MSCOCO test set, we compare five state-of-the-art vision-and-language LLMs, decoded with greedy and nucleus sampling, to human captions. Measured with a caption-trained n-gram LM, humans display roughly twice the surprisal variance of models, but rescoring the same captions with a general-language model reverses the pattern. Our analysis introduces the surprisal-based diversity metric for image captioning. We show that relying on a single scorer can completely invert conclusions, thus, robust diversity evaluation must report surprisal under several scorers.

Paper Structure

This paper contains 16 sections, 1 equation, 1 figure, 4 tables.

Figures (1)

  • Figure 1: An example image from MSCOCO test set with one human reference and one model caption.