Table of Contents
Fetching ...

TPCap: Unlocking Zero-Shot Image Captioning with Trigger-Augmented and Multi-Modal Purification Modules

Ruoyu Zhang, Lulu Wang, Yi He, Tongling Pan, Zhengtao Yu, Yingna Li

TL;DR

TPCap introduces a retrieval-free image captioning framework that harnesses zero-shot capabilities of large language models through a trigger-augmented design and a multi-modal purification module. A two-stage trigger projector aligns visual and textual features, while MP refines entity information to reduce noise and hallucination, enabling accurate captions with only $0.82\text{M}$ trainable parameters on a single RTX 4090. Evaluations on COCO, NoCaps, Flickr30k, and WHOOPS demonstrate competitive performance against state-of-the-art methods and strong open-world reasoning on WHOOPS, highlighting the practicality of a lightweight, scalable alternative to retrieval-based approaches. The work advances efficient cross-modal alignment by leveraging LLMs without external retrieval banks, offering a robust, low-resource solution for zero-shot image captioning.

Abstract

Recent advancements in large language models (LLMs) have significantly enhanced the fluency and logical coherence of image captioning. Retrieval-Augmented Generation (RAG) is widely adopted to incorporate external knowledge into LLMs; however, existing RAG-based methods rely on separate retrieval banks, introducing computational overhead and limiting the utilization of LLMs' inherent zero-shot capabilities. To address these limitations, we propose TPCap, a novel trigger-augmented and multi-modal purification framework for zero-shot image captioning without external retrieval libraries. TPCap consists of two key components: trigger-augmented (TA) generation and multi-modal purification (MP). The TA module employs a trigger projector with frozen and learnable projections to activate LLMs' contextual reasoning, enhance visual-textual alignment, and mitigate data bias. The MP module further refines the generated entity-related information by filtering noise and enhancing feature quality, ensuring more precise and factually consistent captions. We evaluate TPCap on COCO, NoCaps, Flickr30k, and WHOOPS datasets. With only 0.82M trainable parameters and training on a single NVIDIA RTX 4090 GPU, TPCap achieves competitive performance comparable to state-of-the-art models.

TPCap: Unlocking Zero-Shot Image Captioning with Trigger-Augmented and Multi-Modal Purification Modules

TL;DR

TPCap introduces a retrieval-free image captioning framework that harnesses zero-shot capabilities of large language models through a trigger-augmented design and a multi-modal purification module. A two-stage trigger projector aligns visual and textual features, while MP refines entity information to reduce noise and hallucination, enabling accurate captions with only trainable parameters on a single RTX 4090. Evaluations on COCO, NoCaps, Flickr30k, and WHOOPS demonstrate competitive performance against state-of-the-art methods and strong open-world reasoning on WHOOPS, highlighting the practicality of a lightweight, scalable alternative to retrieval-based approaches. The work advances efficient cross-modal alignment by leveraging LLMs without external retrieval banks, offering a robust, low-resource solution for zero-shot image captioning.

Abstract

Recent advancements in large language models (LLMs) have significantly enhanced the fluency and logical coherence of image captioning. Retrieval-Augmented Generation (RAG) is widely adopted to incorporate external knowledge into LLMs; however, existing RAG-based methods rely on separate retrieval banks, introducing computational overhead and limiting the utilization of LLMs' inherent zero-shot capabilities. To address these limitations, we propose TPCap, a novel trigger-augmented and multi-modal purification framework for zero-shot image captioning without external retrieval libraries. TPCap consists of two key components: trigger-augmented (TA) generation and multi-modal purification (MP). The TA module employs a trigger projector with frozen and learnable projections to activate LLMs' contextual reasoning, enhance visual-textual alignment, and mitigate data bias. The MP module further refines the generated entity-related information by filtering noise and enhancing feature quality, ensuring more precise and factually consistent captions. We evaluate TPCap on COCO, NoCaps, Flickr30k, and WHOOPS datasets. With only 0.82M trainable parameters and training on a single NVIDIA RTX 4090 GPU, TPCap achieves competitive performance comparable to state-of-the-art models.

Paper Structure

This paper contains 28 sections, 6 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Comparison of different methods of using LLMs generates image captioning. (a) Traditional methods. Traditional methods lack detail, making it difficult to generate accurate descriptions. (b) RAG-based methods. This method is limited by the knowledge in the additional retrieval bank. When the additional retrieval bank is about the Frisbee brand, the generated captions can describe the Frisbee in detail, but the description of the dog is still not detailed enough. (c) Our generate augmented method. Our method replaces the additional retrieval bank with LLMs and generates additional information by activating the zero-shot ability of LLMs to assist in generating more detailed descriptions.
  • Figure 2: Overview of the proposed TPCap. We introduce a specialized RAG approach and a trigger projector to assist the network in aligning visual features with text features and enhancing its zero-shot capability. First, given an image, we extract visual features using a frozen visual encoder and generate visual-language features through a frozen Q-Former. Then, the visual-language features are concatenated with language prompt 1 and projected into the shared dimension by a trigger projector to enhance alignment ability. Then, the projected features are input into frozen LLMs to generate coarse-grained information about the entity. Then, a multi-modal purification is used to purify and refine the coarse-grained entity information and align it with visual-language features. Then, we concatenate visual-language features, entity features, and language prompt 2, projected into the shared dimension by the trigger projector to enhance alignment ability. Finally, a frozen LLM uses the projected features to generate output.
  • Figure 3: Three different types of entity-related information processing methods: (a) ours is used to compress, purify, and refine the entity-related information; (b) feature fusion method using cross-attention achieve information fusion; (c) feature refinement method using cross-attention and learnable tokens, which refines the features. Note, for simplicity, we omit the variation in feature dimensions through the linear layer.
  • Figure 4: Five different projector types. (a) L-Projector consists of a single linear layer that projects the input features dimensions to 4096; (b) S-Projector consists of a single shared linear layer, indicating that the projector parameters are shared between trigger-augmented module 1 and trigger-augmented module 2; (c) DL-Projector consists of two linear layers: the first linear layer projects the input features from 768 to 1024, and the second linear layer projects the dimensions from 1024 to 4096; (d) HDL-Projector consists of a learnable linear layer that projects the input feature dimensions to 1024 and a frozen linear layer that projects the feature dimensions from 1024 to 4096; (e) Ours and e are similar, except that the parameters of the first learnable linear layer are shared. Note that the frozen linear layer requires an input dimension of 1024, while our input feature dimension is fixed at 768, so the frozen linear layer cannot exist alone.
  • Figure 5: The results of WHOOPS show that our model has the ability to reason about commonsense compositionality and is capable of describing illogical images, demonstrating the ability to describe an open world. Blue indicates entities. Red indicates entities whose descriptions are not accurate enough.
  • ...and 1 more figures