Table of Contents
Fetching ...

IG Captioner: Information Gain Captioners are Strong Zero-shot Classifiers

Chenglin Yang, Siyuan Qiao, Yuan Cao, Yu Zhang, Tao Zhu, Alan Yuille, Jiahui Yu

TL;DR

This paper addresses the gap between generative captioners and discriminative zero-shot classifiers by mitigating linguistic priors that bias caption output away from visual grounding. It introduces Information Gain (IG) as an evaluation metric, defined by the log-ratio $\log P(T|I) - \log P(T)$, to emphasize information provided by imagery, and pairs it with a two-objective training objective to produce an IG captioner pretrained on Laion-5B. Empirically, IG captioner significantly improves zero-shot ImageNet classification (up to ~18–19.7% top-1) and achieves competitive performance with CLIP across ImageNet and zero-shot image-text retrieval on MSCOCO and Flickr30K, while reducing the gap in retrieval recalls. The results demonstrate that a purely generative training pipeline can attain strong discriminative capabilities when the evaluation and training objectives explicitly discount text priors, suggesting a path toward unifying generative and discriminative visual-language learning.

Abstract

Generative training has been demonstrated to be powerful for building visual-language models. However, on zero-shot discriminative benchmarks, there is still a performance gap between models trained with generative and discriminative objectives. In this paper, we aim to narrow this gap by improving the efficacy of generative training on classification tasks, without any finetuning processes or additional modules. Specifically, we focus on narrowing the gap between the generative captioner and the CLIP classifier. We begin by analysing the predictions made by the captioner and classifier and observe that the caption generation inherits the distribution bias from the language model trained with pure text modality, making it less grounded on the visual signal. To tackle this problem, we redesign the scoring objective for the captioner to alleviate the distributional bias and focus on measuring the gain of information brought by the visual inputs. We further design a generative training objective to match the evaluation objective. We name our model trained and evaluated from the novel procedures as Information Gain (IG) captioner. We pretrain the models on the public Laion-5B dataset and perform a series of discriminative evaluations. For the zero-shot classification on ImageNet, IG captioner achieves $> 18\%$ improvements over the standard captioner, achieving comparable performances with the CLIP classifier. IG captioner also demonstrated strong performance on zero-shot image-text retrieval tasks on MSCOCO and Flickr30K. We hope this paper inspires further research towards unifying generative and discriminative training procedures for visual-language models.

IG Captioner: Information Gain Captioners are Strong Zero-shot Classifiers

TL;DR

This paper addresses the gap between generative captioners and discriminative zero-shot classifiers by mitigating linguistic priors that bias caption output away from visual grounding. It introduces Information Gain (IG) as an evaluation metric, defined by the log-ratio , to emphasize information provided by imagery, and pairs it with a two-objective training objective to produce an IG captioner pretrained on Laion-5B. Empirically, IG captioner significantly improves zero-shot ImageNet classification (up to ~18–19.7% top-1) and achieves competitive performance with CLIP across ImageNet and zero-shot image-text retrieval on MSCOCO and Flickr30K, while reducing the gap in retrieval recalls. The results demonstrate that a purely generative training pipeline can attain strong discriminative capabilities when the evaluation and training objectives explicitly discount text priors, suggesting a path toward unifying generative and discriminative visual-language learning.

Abstract

Generative training has been demonstrated to be powerful for building visual-language models. However, on zero-shot discriminative benchmarks, there is still a performance gap between models trained with generative and discriminative objectives. In this paper, we aim to narrow this gap by improving the efficacy of generative training on classification tasks, without any finetuning processes or additional modules. Specifically, we focus on narrowing the gap between the generative captioner and the CLIP classifier. We begin by analysing the predictions made by the captioner and classifier and observe that the caption generation inherits the distribution bias from the language model trained with pure text modality, making it less grounded on the visual signal. To tackle this problem, we redesign the scoring objective for the captioner to alleviate the distributional bias and focus on measuring the gain of information brought by the visual inputs. We further design a generative training objective to match the evaluation objective. We name our model trained and evaluated from the novel procedures as Information Gain (IG) captioner. We pretrain the models on the public Laion-5B dataset and perform a series of discriminative evaluations. For the zero-shot classification on ImageNet, IG captioner achieves improvements over the standard captioner, achieving comparable performances with the CLIP classifier. IG captioner also demonstrated strong performance on zero-shot image-text retrieval tasks on MSCOCO and Flickr30K. We hope this paper inspires further research towards unifying generative and discriminative training procedures for visual-language models.
Paper Structure (36 sections, 6 equations, 5 figures, 14 tables)

This paper contains 36 sections, 6 equations, 5 figures, 14 tables.

Figures (5)

  • Figure 1: The prediction of a captioner-based classifier is influenced by the linguistic priors in pure text modality. The Information Gain (IG) evaluation reduces such impact and makes the predictions more grounded on the visual inputs. We illustrate with real predictions on zero-shot ImageNet deng2009imagenet classification in this figure. (a) The language model was trained on Laion-5B schuhmann2022laion captions only. (b) The captioner was trained on the Laion-5B dataset with both the images and captions. The IG evaluation in (b) uses the outputs of both the captioner and the language model in (a).
  • Figure 2: The inference pipeline of IG captioner. $I$ and $T$ represent the input image and caption. IG captioner consists of a image encoder and a text decoder. The text decoder is able to provide both the multimodal and unimodal predictions. The unimodal predictions can be cached for different input images. Both the image and object classes are from the ImageNet dataset.
  • Figure 3: See correlations between the green line ($- \log P(T|I)$) and the orange line ($- \log P(T)$) on zero-shot ImageNet classification. $\log P(T|I)$ is predicted by a multimodal captioner trained on the Laion-5B dataset. $\log P(T)$ is predicted by an unimodal language model trained on the Laion-5B captions only. 100 ImageNet classes are randomly sampled due to the limited space. The numerical correlation measurements between $\log P(T|I)$ and $\log P(T)$ for the whole 1000 classes are shown in Tab. \ref{['tab: pcc_pti_pt']}.
  • Figure 4: The training pipeline of IG captioner. IG captioner has two modes. Without any given images, IG captioner is an unimodal language model. When given an input image, IG captioner becomes the multimodal captioner that models the conditional probability of its caption. The image and caption are from the Laion-5B dataset.
  • Figure 5: Ablations of the Information Gain (IG) evaluation on zero-shot ImageNet classification. All the models are trained on the Laion-5B dataset. Captioner + IG eval uses the evaluation objective, $\log P(T|I) - \alpha \log P(T|\mathbf{0})$. $\log P(T|\mathbf{0})$ is the prediction of the captioner with the input being the zero-intensity image. It is used to approximate $\log P(T)$, because the standard captioner after training is not able to directly predict $\log P(T)$.