Table of Contents
Fetching ...

Negative Entity Suppression for Zero-Shot Captioning with Synthetic Images

Zimao Lu, Hui Xu, Bing Liu, Ke Wang

TL;DR

This work tackles cross-domain degradation in zero-shot image captioning by introducing Negative Entity Suppression (NES), a unified framework that uses synthetic images for image-to-text retrieval, filters out negative, hallucination-prone entities, and applies attention-level suppression to reduce their influence. By bridging the modality gap with synthetic visuals and mitigating retrieval-induced and language-prior hallucinations, NES preserves in-domain performance while substantially improving cross-domain transfer, achieving new state-of-the-art results on Flickr30k→COCO and strong gains on NoCaps. The contributions include a training-time negative-entity classification, a similarity-based inference filter, and a targeted suppression mechanism that reduces hallucinations without sacrificing caption quality. The approach has practical impact for deploying zero-shot captioning systems across diverse visual domains without costly paired data.

Abstract

Text-only training provides an attractive approach to address data scarcity challenges in zero-shot image captioning (ZIC), avoiding the expense of collecting paired image-text annotations. However, although these approaches perform well within training domains, they suffer from poor cross-domain generalization, often producing hallucinated content when encountering novel visual environments. Retrieval-based methods attempt to mitigate this limitation by leveraging external knowledge, but they can paradoxically exacerbate hallucination when retrieved captions contain entities irrelevant to the inputs. We introduce the concept of negative entities--objects that appear in generated caption but are absent from the input--and propose Negative Entity Suppression (NES) to tackle this challenge. NES seamlessly integrates three stages: (1) it employs synthetic images to ensure consistent image-to-text retrieval across both training and inference; (2) it filters negative entities from retrieved content to enhance accuracy; and (3) it applies attention-level suppression using identified negative entities to further minimize the impact of hallucination-prone features. Evaluation across multiple benchmarks demonstrates that NES maintains competitive in-domain performance while improving cross-domain transfer and reducing hallucination rates, achieving new state-of-the-art results in ZIC. Our code is available at https://github.com/nidongpinyinme/NESCap.

Negative Entity Suppression for Zero-Shot Captioning with Synthetic Images

TL;DR

This work tackles cross-domain degradation in zero-shot image captioning by introducing Negative Entity Suppression (NES), a unified framework that uses synthetic images for image-to-text retrieval, filters out negative, hallucination-prone entities, and applies attention-level suppression to reduce their influence. By bridging the modality gap with synthetic visuals and mitigating retrieval-induced and language-prior hallucinations, NES preserves in-domain performance while substantially improving cross-domain transfer, achieving new state-of-the-art results on Flickr30k→COCO and strong gains on NoCaps. The contributions include a training-time negative-entity classification, a similarity-based inference filter, and a targeted suppression mechanism that reduces hallucinations without sacrificing caption quality. The approach has practical impact for deploying zero-shot captioning systems across diverse visual domains without costly paired data.

Abstract

Text-only training provides an attractive approach to address data scarcity challenges in zero-shot image captioning (ZIC), avoiding the expense of collecting paired image-text annotations. However, although these approaches perform well within training domains, they suffer from poor cross-domain generalization, often producing hallucinated content when encountering novel visual environments. Retrieval-based methods attempt to mitigate this limitation by leveraging external knowledge, but they can paradoxically exacerbate hallucination when retrieved captions contain entities irrelevant to the inputs. We introduce the concept of negative entities--objects that appear in generated caption but are absent from the input--and propose Negative Entity Suppression (NES) to tackle this challenge. NES seamlessly integrates three stages: (1) it employs synthetic images to ensure consistent image-to-text retrieval across both training and inference; (2) it filters negative entities from retrieved content to enhance accuracy; and (3) it applies attention-level suppression using identified negative entities to further minimize the impact of hallucination-prone features. Evaluation across multiple benchmarks demonstrates that NES maintains competitive in-domain performance while improving cross-domain transfer and reducing hallucination rates, achieving new state-of-the-art results in ZIC. Our code is available at https://github.com/nidongpinyinme/NESCap.

Paper Structure

This paper contains 22 sections, 4 equations, 5 figures, 11 tables.

Figures (5)

  • Figure 1: Illustration of cross-domain degradation in ZIC models. COCO: training and inference on COCO dataset; Flickr to COCO: training on Flickr30k and inference on COCO. Correct and incorrect entities are marked in green and red, respectively. Entity-aware models, lacking image-text pairs, tend to generate hallucinated entities with high semantic association to existing entities. Retrieval-based models achieve better in-domain performance but introduce new hallucinations that reduce cross-domain generalization. Our model incorporates identification and suppression modules to effectively mitigate the impact of hallucinated information in retrieved content.
  • Figure 2: Analysis of hallucination patterns in existing ZIC methods. Hallucinated entities are classified as retrieved (originating from retrieved captions) or others across four scenarios. The results demonstrate that retrieval-based approaches such as IFCap suffer from the risk of hallucinations, while our NES method effectively reduces both the overall number of hallucinated entities and those specifically induced by retrieval.
  • Figure 3: Analysis of retrieval performance across different data sources and CLIP encoders on the COCO dataset (normalized). Higher values indicate better performance for accuracy (ACC) and recall (RC), while lower values indicate better performance for average hallucination count per image (AHC) and deduplicated hallucination count (DHC).
  • Figure 4: Overall framework of NES. (a) Training phase: The framework generates synthetic images from input text using diffusion models, then retrieves relevant captions via IR (Image-to-text Retrieval). The SIF (Synthetic Image Fusion) module enhances input text features using synthetic images. The NEF (Negative Entity Filtering) module categorizes retrieved entities into positive and negative sets based on input text entities. The AS (Attention-level Suppression) module first fuses retrieved captions with enhanced input features, then suppresses hallucination-prone features using negative entities. Final features are concatenated with positive entities and fed to GPT-2 for generation. (b) Inference phase: The pipeline directly uses input images for retrieval without SIF enhancement. NEF filtering is performed using image-extracted entities with CLIP similarity. The AS module remains consistent with training.
  • Figure 5: Qualitative result on the Flickr30k-to-COCO test set.