Table of Contents
Fetching ...

Visually-Aware Context Modeling for News Image Captioning

Tingyu Qu, Tinne Tuytelaars, Marie-Francine Moens

TL;DR

The paper tackles News Image Captioning by differentiating visual inputs into faces that can be directly grounded to article names and non-grounded context. It introduces a face naming module with prefix-attention, a CLIP-based sentence retrieval strategy to connect images with article segments, and a CoLaM margin-based training regime to emphasize article context. The approach, built on a BART-based encoder-decoder, yields state-of-the-art CIDEr scores on GoodNews and NYTimes800k without external data and is supported by comprehensive ablations. This modular framework improves grounding of captions to both images and articles, with practical impact for more informative and context-aware news captions.

Abstract

News Image Captioning aims to create captions from news articles and images, emphasizing the connection between textual context and visual elements. Recognizing the significance of human faces in news images and the face-name co-occurrence pattern in existing datasets, we propose a face-naming module for learning better name embeddings. Apart from names, which can be directly linked to an image area (faces), news image captions mostly contain context information that can only be found in the article. We design a retrieval strategy using CLIP to retrieve sentences that are semantically close to the image, mimicking human thought process of linking articles to images. Furthermore, to tackle the problem of the imbalanced proportion of article context and image context in captions, we introduce a simple yet effective method Contrasting with Language Model backbone (CoLaM) to the training pipeline. We conduct extensive experiments to demonstrate the efficacy of our framework. We out-perform the previous state-of-the-art (without external data) by 7.97/5.80 CIDEr scores on GoodNews/NYTimes800k. Our code is available at https://github.com/tingyu215/VACNIC.

Visually-Aware Context Modeling for News Image Captioning

TL;DR

The paper tackles News Image Captioning by differentiating visual inputs into faces that can be directly grounded to article names and non-grounded context. It introduces a face naming module with prefix-attention, a CLIP-based sentence retrieval strategy to connect images with article segments, and a CoLaM margin-based training regime to emphasize article context. The approach, built on a BART-based encoder-decoder, yields state-of-the-art CIDEr scores on GoodNews and NYTimes800k without external data and is supported by comprehensive ablations. This modular framework improves grounding of captions to both images and articles, with practical impact for more informative and context-aware news captions.

Abstract

News Image Captioning aims to create captions from news articles and images, emphasizing the connection between textual context and visual elements. Recognizing the significance of human faces in news images and the face-name co-occurrence pattern in existing datasets, we propose a face-naming module for learning better name embeddings. Apart from names, which can be directly linked to an image area (faces), news image captions mostly contain context information that can only be found in the article. We design a retrieval strategy using CLIP to retrieve sentences that are semantically close to the image, mimicking human thought process of linking articles to images. Furthermore, to tackle the problem of the imbalanced proportion of article context and image context in captions, we introduce a simple yet effective method Contrasting with Language Model backbone (CoLaM) to the training pipeline. We conduct extensive experiments to demonstrate the efficacy of our framework. We out-perform the previous state-of-the-art (without external data) by 7.97/5.80 CIDEr scores on GoodNews/NYTimes800k. Our code is available at https://github.com/tingyu215/VACNIC.
Paper Structure (28 sections, 6 equations, 6 figures, 20 tables)

This paper contains 28 sections, 6 equations, 6 figures, 20 tables.

Figures (6)

  • Figure 1: Two types of image captions. The image contains all context needed for the generic image caption, while in the news image caption, we find more named entities, including the name of a celebrity whose face appears in the image, and context that is retrieved from the corresponding news article. Most of the context in the news image caption requires linking the image to the article.
  • Figure 2: Method illustration. Our model is an encoder-decoder model built on BART (middle). Our method consists of: (a) Integrating Features into BART: In BART encoder, we concatenate visual ($H_V$) and name features ($H_E$) to obtain keys and values for the added cross-attention module; (b) Face Naming Module: We first get the embedding $H_N$ of the chain of person names in the article. Then we prepend the face features $H_F$ to $H_N$ to obtain keys and values for the prefix-augmented self-attention module; (c) CLIP Retrieval: We conduct sentence retrieval using CLIP to learn from more accurate article context. (d) Contrasting with LM backbone (CoLaM): We contrast the multimodal BART with frozen pure-text BART to force the model to focus more on the article context.
  • Figure 3: Comparison between two types of image captions for image in Table \ref{['tab:qualitative']} (1)
  • Figure 4: Comparison between two types of image captions for image in Table \ref{['tab:qualitative']} (2)
  • Figure 5: Qualitative comparison w/ or w/o Face Naming. For the correctly grounded person, we mark the names in green; for the wrongly grounded person or person not appearing in the image, we mark the names in red. Here we use the models with $\text{BART}_{\text{base}}$ as backbone LM for comparison.
  • ...and 1 more figures