Table of Contents
Fetching ...

EVCap: Retrieval-Augmented Image Captioning with External Visual-Name Memory for Open-World Comprehension

Jiaxuan Li, Duc Minh Vo, Akihiro Sugimoto, Hideki Nakayama

TL;DR

Open-world image captioning requires describing novel objects without costly retraining. EVCap introduces a retrieval-augmented approach that uses an external visual-name memory and a lightweight fusion module to supply object names to a frozen LLM, enabling open-world comprehension with only $3.97\text{M}$ trainable parameters. The memory is built from LVIS images plus synthetic data and is expandable with WHOOPS to cover new objects. Evaluations on COCO, NoCaps, Flickr30k, and WHOOPS show EVCap achieves competitive CIDEr and related metrics with far less training, demonstrating effective open-world adaptation.

Abstract

Large language models (LLMs)-based image captioning has the capability of describing objects not explicitly observed in training data; yet novel objects occur frequently, necessitating the requirement of sustaining up-to-date object knowledge for open-world comprehension. Instead of relying on large amounts of data and/or scaling up network parameters, we introduce a highly effective retrieval-augmented image captioning method that prompts LLMs with object names retrieved from External Visual--name memory (EVCap). We build ever-changing object knowledge memory using objects' visuals and names, enabling us to (i) update the memory at a minimal cost and (ii) effortlessly augment LLMs with retrieved object names by utilizing a lightweight and fast-to-train model. Our model, which was trained only on the COCO dataset, can adapt to out-of-domain without requiring additional fine-tuning or re-training. Our experiments conducted on benchmarks and synthetic commonsense-violating data show that EVCap, with only 3.97M trainable parameters, exhibits superior performance compared to other methods based on frozen pre-trained LLMs. Its performance is also competitive to specialist SOTAs that require extensive training.

EVCap: Retrieval-Augmented Image Captioning with External Visual-Name Memory for Open-World Comprehension

TL;DR

Open-world image captioning requires describing novel objects without costly retraining. EVCap introduces a retrieval-augmented approach that uses an external visual-name memory and a lightweight fusion module to supply object names to a frozen LLM, enabling open-world comprehension with only trainable parameters. The memory is built from LVIS images plus synthetic data and is expandable with WHOOPS to cover new objects. Evaluations on COCO, NoCaps, Flickr30k, and WHOOPS show EVCap achieves competitive CIDEr and related metrics with far less training, demonstrating effective open-world adaptation.

Abstract

Large language models (LLMs)-based image captioning has the capability of describing objects not explicitly observed in training data; yet novel objects occur frequently, necessitating the requirement of sustaining up-to-date object knowledge for open-world comprehension. Instead of relying on large amounts of data and/or scaling up network parameters, we introduce a highly effective retrieval-augmented image captioning method that prompts LLMs with object names retrieved from External Visual--name memory (EVCap). We build ever-changing object knowledge memory using objects' visuals and names, enabling us to (i) update the memory at a minimal cost and (ii) effortlessly augment LLMs with retrieved object names by utilizing a lightweight and fast-to-train model. Our model, which was trained only on the COCO dataset, can adapt to out-of-domain without requiring additional fine-tuning or re-training. Our experiments conducted on benchmarks and synthetic commonsense-violating data show that EVCap, with only 3.97M trainable parameters, exhibits superior performance compared to other methods based on frozen pre-trained LLMs. Its performance is also competitive to specialist SOTAs that require extensive training.
Paper Structure (24 sections, 2 equations, 15 figures, 7 tables)

This paper contains 24 sections, 2 equations, 15 figures, 7 tables.

Figures (15)

  • Figure 1: Overall comparison of our EVCap and SOTAs. (Upper) Generated captions by SmallCap, BLIP-2, and our EVCap for a commonsense-violating image from the WHOOPS dataset. $\times$ and $\checkmark$ indicate incorrect and correct predictions, respectively. Incorrect objects in captions are highlighted in red, while correct ones are in blue. SmallCap and BLIP-2 give incorrect predictions for "ice skates" and "wood floor", respectively, while our EVCap utilizes an external visual--name memory to enhance attention to objects within the image, leading to superior performance for image captioning. (Lower) Comparison of the number of trainable parameters, CIDEr score on COCO and NoCaps datasets. The size of each circle reflects the log number of trainable parameters. EVCap (3.97M) has less trainable parameters than others while achieving comparable results with SOTAs at scale.
  • Figure 2: Schematic of our proposed EVCap. It consists of an external visual--name memory with image embeddings and object names (upper), a frozen ViT and Q-Former equipped with trainable image query tokens, an attentive fusion module developed by a customized frozen Q-Former and trainable object name query tokens, and a frozen LLM with a trainable linear layer (lower). The ViT and Q-Former extract learned visual features from the input image, which are then used to retrieve object names from the external memory. These retrieved object names and learned visual features undergo cross-attention in the customized Q-Former, creating refined object name features. Finally, the object name features combined with visual features are fed into the LLM post a linear layer for generating captions.
  • Figure 3: Examples of captions generated by our EVCap and three SOTA methods on COCO test set, NoCaps validation set, and Flickr30k test set. GT refers to the Ground Truth captions. Incorrect objects in captions are highlighted in red, while correct ones are in blue. Our EVCap correctly generates captions across different datasets, showing performance comparable to BLIP-2.
  • Figure 4: Examples of captions generated by our EVCap, EVCap (w/ WHOOPS), and three SOTAs on WHOOPS dataset. Incorrect objects are highlighted in red, while correct ones are in blue.
  • Figure 5: Visualization of the captions generated from ablation study on the NoCaps validation set. We also show the retrieved object names by EVCap, presented in gray. Incorrect objects in captions are highlighted in red, while correct ones are in blue.
  • ...and 10 more figures