Table of Contents
Fetching ...

The Solution for the CVPR2023 NICE Image Captioning Challenge

Xiangyu Wu, Yi Gao, Hailiang Zhang, Yang Yang, Weili Guo, Jianfeng Lu

TL;DR

The paper tackles zero-shot image captioning for the NICE challenge, where diverse concepts and image types complicate traditional captioning. It proposes a four-stage OFA-based pipeline trained on large-scale Laion-5B data, integrating contrastive learning, a similarity-bucket prompting scheme, retrieval-augmented templates, and model-ensemble to boost caption quality and relevance. Key contributions include the unified pretraining-finetuning architecture, bucketed prompt control, external knowledge retrieval to enrich templates, and ensemble strategies that yield state-of-the-art Cider scores on validation and test sets. The results demonstrate that combining template-driven prompts with retrieval augmentation and cross-modal alignment yields strong zero-shot performance and broad generalization across domains.

Abstract

In this paper, we present our solution to the New frontiers for Zero-shot Image Captioning Challenge. Different from the traditional image captioning datasets, this challenge includes a larger new variety of visual concepts from many domains (such as COVID-19) as well as various image types (photographs, illustrations, graphics). For the data level, we collect external training data from Laion-5B, a large-scale CLIP-filtered image-text dataset. For the model level, we use OFA, a large-scale visual-language pre-training model based on handcrafted templates, to perform the image captioning task. In addition, we introduce contrastive learning to align image-text pairs to learn new visual concepts in the pre-training stage. Then, we propose a similarity-bucket strategy and incorporate this strategy into the template to force the model to generate higher quality and more matching captions. Finally, by retrieval-augmented strategy, we construct a content-rich template, containing the most relevant top-k captions from other image-text pairs, to guide the model in generating semantic-rich captions. Our method ranks first on the leaderboard, achieving 105.17 and 325.72 Cider-Score in the validation and test phase, respectively.

The Solution for the CVPR2023 NICE Image Captioning Challenge

TL;DR

The paper tackles zero-shot image captioning for the NICE challenge, where diverse concepts and image types complicate traditional captioning. It proposes a four-stage OFA-based pipeline trained on large-scale Laion-5B data, integrating contrastive learning, a similarity-bucket prompting scheme, retrieval-augmented templates, and model-ensemble to boost caption quality and relevance. Key contributions include the unified pretraining-finetuning architecture, bucketed prompt control, external knowledge retrieval to enrich templates, and ensemble strategies that yield state-of-the-art Cider scores on validation and test sets. The results demonstrate that combining template-driven prompts with retrieval augmentation and cross-modal alignment yields strong zero-shot performance and broad generalization across domains.

Abstract

In this paper, we present our solution to the New frontiers for Zero-shot Image Captioning Challenge. Different from the traditional image captioning datasets, this challenge includes a larger new variety of visual concepts from many domains (such as COVID-19) as well as various image types (photographs, illustrations, graphics). For the data level, we collect external training data from Laion-5B, a large-scale CLIP-filtered image-text dataset. For the model level, we use OFA, a large-scale visual-language pre-training model based on handcrafted templates, to perform the image captioning task. In addition, we introduce contrastive learning to align image-text pairs to learn new visual concepts in the pre-training stage. Then, we propose a similarity-bucket strategy and incorporate this strategy into the template to force the model to generate higher quality and more matching captions. Finally, by retrieval-augmented strategy, we construct a content-rich template, containing the most relevant top-k captions from other image-text pairs, to guide the model in generating semantic-rich captions. Our method ranks first on the leaderboard, achieving 105.17 and 325.72 Cider-Score in the validation and test phase, respectively.
Paper Structure (15 sections, 1 equation, 4 figures, 1 table)

This paper contains 15 sections, 1 equation, 4 figures, 1 table.

Figures (4)

  • Figure 1: (a): NoCaps dataset, which always includes common objects such as animals, plants and furniture, etc. (b)(c): NICE Challenge dataset, which includes many novel visual concepts and various image types, such as famous historic, cultural and graphics,etc.
  • Figure 2: Overall Architecture. Our solution consists of four main stages, which include Pre-training, Coarse-tuning, Fine-tuning, and Model-ensemble. The training data for the first three stages are all collected from the large-scale Laion-5B dataset.
  • Figure 3: Similarity-bucket is utilized in pre-training, coarse-tuning, and fine-tuning stages.
  • Figure 4: Similarity-bucket is utilized in pre-training, coarse-tuning, and fine-tuning stages.