Surprisal reveals diversity gaps in image captioning and different scorers change the story
Nikolai Ilinykh, Simon Dobnik
TL;DR
This work addresses how to quantify linguistic diversity in image captioning by introducing surprisal variance as a diversity metric. It presents a dual-scorer framework, using (i) a caption-domain bi-/tri-gram LM and (ii) a general-domain GPT-2, to compute token-level surprisal $I(w_t) = - \log P_{\theta}(w_t \mid w_{<t})$ and compare human versus model captions on the MSCOCO Karpathy test split. The key finding is scorer-dependent: under the in-domain scorer, humans show about twice the surprisal variance of five state-of-the-art vision-language models; under the general-domain scorer, this pattern reverses, highlighting the complementarity of scoring perspectives. The study argues that future benchmarking must report surprisal-based diversity across multiple scorers to avoid misleading conclusions about human versus model variability and to better capture pragmatic diversity in captioning.
Abstract
We quantify linguistic diversity in image captioning with surprisal variance - the spread of token-level negative log-probabilities within a caption set. On the MSCOCO test set, we compare five state-of-the-art vision-and-language LLMs, decoded with greedy and nucleus sampling, to human captions. Measured with a caption-trained n-gram LM, humans display roughly twice the surprisal variance of models, but rescoring the same captions with a general-language model reverses the pattern. Our analysis introduces the surprisal-based diversity metric for image captioning. We show that relying on a single scorer can completely invert conclusions, thus, robust diversity evaluation must report surprisal under several scorers.
