The Solution for the CVPR2024 NICE Image Captioning Challenge

Longfei Huang; Shupeng Zhong; Xiangyu Wu; Ruoxuan Li

The Solution for the CVPR2024 NICE Image Captioning Challenge

Longfei Huang, Shupeng Zhong, Xiangyu Wu, Ruoxuan Li

TL;DR

This work tackles zero-shot image captioning for NICE 2024 by addressing style-content gaps in human annotations through retrieval-augmented data and a caption-level strategy built on the OFA framework with handcrafted templates. It introduces a data discovery pipeline using EVA-CLIP and Adaption Re-ranking to assemble high-quality training material from model-generated captions and constructs a mini knowledge base to guide caption formation via retrieved prompts, culminating in a CIDEr-optimized ensemble. The key contributions are the retrieval-augmented fine-tuning and the caption-level control that improve caption quality and alignment with manual-style annotations, achieving a CIDEr score of $234.11$ on NICE 2024. The results underscore the importance of data quality and external knowledge integration for robust zero-shot captioning and suggest avenues for self-iterative improvements without collecting new data.

Abstract

This report introduces a solution to the Topic 1 Zero-shot Image Captioning of 2024 NICE : New frontiers for zero-shot Image Captioning Evaluation. In contrast to NICE 2023 datasets, this challenge involves new annotations by humans with significant differences in caption style and content. Therefore, we enhance image captions effectively through retrieval augmentation and caption grading methods. At the data level, we utilize high-quality captions generated by image caption models as training data to address the gap in text styles. At the model level, we employ OFA (a large-scale visual-language pre-training model based on handcrafted templates) to perform the image captioning task. Subsequently, we propose caption-level strategy for the high-quality caption data generated by the image caption models and integrate them with retrieval augmentation strategy into the template to compel the model to generate higher quality, more matching, and semantically enriched captions based on the retrieval augmentation prompts. Our approach achieves a CIDEr score of 234.11.

The Solution for the CVPR2024 NICE Image Captioning Challenge

TL;DR

on NICE 2024. The results underscore the importance of data quality and external knowledge integration for robust zero-shot captioning and suggest avenues for self-iterative improvements without collecting new data.

Abstract

Paper Structure (15 sections, 4 equations, 5 figures, 4 tables)

This paper contains 15 sections, 4 equations, 5 figures, 4 tables.

Introduction
Related Work
Vision-language Pre-training Models
Image Captioning
Vision-Language Retrieval
Methodology
Overall Architecture
Data discovery
Retrieval-augmented
Caption-level
Model-ensemble
Experiments
Implementation Detail
Main Result
Conclusion

Figures (5)

Figure 1: Shutterstock dataset: the web-scraped data, exhibit significant differences in text style compared to manually annotated data. In contrast, the data generated by the model aligns closely with the stylistic characteristics of manually annotated datasets such as COCO and NICE 2024.
Figure 2: Overall Architecture. Our solution consists of four main stages, which includes Data discovery, Fine-tuning (Retrieval-augmented and Caption-level strategies) and Model-ensemble. The training data are all collected from the models generated dataset.
Figure 3: Establishing a dataset using visual language retrieval with the EVA-CLIP model and Adaption Re-ranking method.
Figure 4: Caption-level is utilized in fine-tuning stages.
Figure 5: A comparison of prediction results between fine-tuning directly on the COCO dataset, data crawled from the web, and our method. COCO dataset has a relatively limited knowledge scope, and "COCO fine-tuned" often results in vague descriptions (such as "some food") that fail to accurately predict scenes and object categories. While web-crawled data encompasses a broader range of knowledge, "web-crawled fine-tuned" can accurately predict object categories but may introduce fabricated information, leading to misleading terms (such as "organic tomatoes"), thereby diminishing the practicality of the model. In contrast, our approach not only generates detailed descriptions but also avoids misleading terms, significantly enhancing the model's performance and practicality.

The Solution for the CVPR2024 NICE Image Captioning Challenge

TL;DR

Abstract

The Solution for the CVPR2024 NICE Image Captioning Challenge

Authors

TL;DR

Abstract

Table of Contents

Figures (5)