Table of Contents
Fetching ...

Teaching CLIP to Count to Ten

Roni Paiss, Ariel Ephrat, Omer Tov, Shiran Zada, Inbar Mosseri, Michal Irani, Tali Dekel

TL;DR

The paper identifies counting as a fundamental but underdeveloped capability in vision-language models like CLIP. It introduces a counting-focused finetuning framework that combines the original CLIP objective with a counting-contrastive loss trained on a carefully filtered counting dataset, leveraging counterfactual captions to create hard negatives. A new CountBench dataset is proposed to evaluate counting ability, and the authors demonstrate significant improvements in zero-shot counting and downstream tasks such as image retrieval and text-conditioned generation, while preserving performance on standard benchmarks. The work highlights practical benefits for precise content retrieval and generation and discusses societal implications and future extensions to other compositional concepts.

Abstract

Large vision-language models (VLMs), such as CLIP, learn rich joint image-text representations, facilitating advances in numerous downstream tasks, including zero-shot classification and text-to-image generation. Nevertheless, existing VLMs exhibit a prominent well-documented limitation - they fail to encapsulate compositional concepts such as counting. We introduce a simple yet effective method to improve the quantitative understanding of VLMs, while maintaining their overall performance on common benchmarks. Specifically, we propose a new counting-contrastive loss used to finetune a pre-trained VLM in tandem with its original objective. Our counting loss is deployed over automatically-created counterfactual examples, each consisting of an image and a caption containing an incorrect object count. For example, an image depicting three dogs is paired with the caption "Six dogs playing in the yard". Our loss encourages discrimination between the correct caption and its counterfactual variant which serves as a hard negative example. To the best of our knowledge, this work is the first to extend CLIP's capabilities to object counting. Furthermore, we introduce "CountBench" - a new image-text counting benchmark for evaluating a model's understanding of object counting. We demonstrate a significant improvement over state-of-the-art baseline models on this task. Finally, we leverage our count-aware CLIP model for image retrieval and text-conditioned image generation, demonstrating that our model can produce specific counts of objects more reliably than existing ones.

Teaching CLIP to Count to Ten

TL;DR

The paper identifies counting as a fundamental but underdeveloped capability in vision-language models like CLIP. It introduces a counting-focused finetuning framework that combines the original CLIP objective with a counting-contrastive loss trained on a carefully filtered counting dataset, leveraging counterfactual captions to create hard negatives. A new CountBench dataset is proposed to evaluate counting ability, and the authors demonstrate significant improvements in zero-shot counting and downstream tasks such as image retrieval and text-conditioned generation, while preserving performance on standard benchmarks. The work highlights practical benefits for precise content retrieval and generation and discusses societal implications and future extensions to other compositional concepts.

Abstract

Large vision-language models (VLMs), such as CLIP, learn rich joint image-text representations, facilitating advances in numerous downstream tasks, including zero-shot classification and text-to-image generation. Nevertheless, existing VLMs exhibit a prominent well-documented limitation - they fail to encapsulate compositional concepts such as counting. We introduce a simple yet effective method to improve the quantitative understanding of VLMs, while maintaining their overall performance on common benchmarks. Specifically, we propose a new counting-contrastive loss used to finetune a pre-trained VLM in tandem with its original objective. Our counting loss is deployed over automatically-created counterfactual examples, each consisting of an image and a caption containing an incorrect object count. For example, an image depicting three dogs is paired with the caption "Six dogs playing in the yard". Our loss encourages discrimination between the correct caption and its counterfactual variant which serves as a hard negative example. To the best of our knowledge, this work is the first to extend CLIP's capabilities to object counting. Furthermore, we introduce "CountBench" - a new image-text counting benchmark for evaluating a model's understanding of object counting. We demonstrate a significant improvement over state-of-the-art baseline models on this task. Finally, we leverage our count-aware CLIP model for image retrieval and text-conditioned image generation, demonstrating that our model can produce specific counts of objects more reliably than existing ones.
Paper Structure (30 sections, 2 equations, 22 figures, 7 tables)

This paper contains 30 sections, 2 equations, 22 figures, 7 tables.

Figures (22)

  • Figure 1: Counting-aware CLIP.We demonstrate the effectiveness of our improved CLIP by showing: (a) image retrieval using text captions with different types of objects and their counts in the image (images that match the caption are marked with ✓ and images that do not match it are marked with ✗). Our model retrieves images that match the requested number of objects, while the baseline CLIP often retrieves images that depict the wrong number of objects, or images where the number is explicitly written in the image (e.g. "nine hearts" - the image contains the number "9", but has 11 hearts). (b) Attention maps demonstrating that our model focuses its attention to all matching object instances in the image, as opposed to the original CLIP.
  • Figure 2: Method overview (a) We create a text-image counting training set in which each caption expresses the number of objects depicted in the corresponding image. This is done by using an off-the-shelf object detector to automatically identify text-image examples in which the text count matches the number of visible objects in the image (see Sec. \ref{['sec:method-data']}). (b) We finetune a pre-trained CLIP model using our counting subset (a), through a dedicated contrastive objective $L_{count}$, used in addition to the original (general) text-image contrastive objective ($L_{clip}$). Specifically, given a text-image example from our counting subset, we automatically create a counterfactual prompt by replacing the true object count in the original caption with an incorrect count; $L_{count}$ encourages the model to embed the image close to its original caption embedding (expressing the true object count) and far from its counterfactual count. (see Sec. \ref{['sec:method-loss']}).
  • Figure 3: Examples of image captions where the numbers are NOT related to object counts. These are automatically filtered-out by our method. In all above examples the numbers indicated in the caption do not refer to an actual object count. Numbers often specify measures, versions, dates, time, written numbers in the image, or numbers that refer to things not visible in the image.
  • Figure 4: CountBench benchmark.Sample images and their corresponding captions from our new CountBench object counting benchmark. This benchmark was automatically curated (and manually verified) from the publicly-available LAION-400M dataset.
  • Figure 5: Confusion matrices on CountBench.Classification accuracy on our new counting benchmark, CountBench, broken down into confusion matrices for the public CLIP ViT-L/14 (a), and our improved CLIP ViT-L/14 model (b), demonstrating clear quantitative superiority of our model.
  • ...and 17 more figures