Table of Contents
Fetching ...

CountCLIP -- [Re] Teaching CLIP to Count to Ten

Harshvardhan Mestha, Tejas Agrawal, Karan Bania, Shreyas V, Yash Bhisikar

TL;DR

This work investigates counting in Vision-Language Models by reproducing and extending the Counting-CLIP approach, finetuning CLIP with a counting loss to align image representations with count captions while preserving zero-shot capabilities. It introduces class-balanced lambda schemes and a CountPlus variant that contrasts against all incorrect counts, evaluating on the CountBench benchmark. Despite using a dramatically smaller counting dataset (∼2,000 examples) and constrained compute, the study shows performance gains over the baseline and provides public counting data and code, while highlighting issues in CountBench such as missing images. The results underscore the potential for reproducible, count-aware VLMs and emphasize the need for diverse counting data and robust benchmarks to better capture higher-count scenarios.

Abstract

Large vision-language models (VLMs) are shown to learn rich joint image-text representations enabling high performances in relevant downstream tasks. However, they fail to showcase their quantitative understanding of objects, and they lack good counting-aware representation. This paper conducts a reproducibility study of 'Teaching CLIP to Count to Ten' (Paiss et al., 2023), which presents a method to finetune a CLIP model (Radford et al., 2021) to improve zero-shot counting accuracy in an image while maintaining the performance for zero-shot classification by introducing a counting-contrastive loss term. We improve the model's performance on a smaller subset of their training data with lower computational resources. We verify these claims by reproducing their study with our own code. The implementation can be found at https://github.com/SforAiDl/CountCLIP.

CountCLIP -- [Re] Teaching CLIP to Count to Ten

TL;DR

This work investigates counting in Vision-Language Models by reproducing and extending the Counting-CLIP approach, finetuning CLIP with a counting loss to align image representations with count captions while preserving zero-shot capabilities. It introduces class-balanced lambda schemes and a CountPlus variant that contrasts against all incorrect counts, evaluating on the CountBench benchmark. Despite using a dramatically smaller counting dataset (∼2,000 examples) and constrained compute, the study shows performance gains over the baseline and provides public counting data and code, while highlighting issues in CountBench such as missing images. The results underscore the potential for reproducible, count-aware VLMs and emphasize the need for diverse counting data and robust benchmarks to better capture higher-count scenarios.

Abstract

Large vision-language models (VLMs) are shown to learn rich joint image-text representations enabling high performances in relevant downstream tasks. However, they fail to showcase their quantitative understanding of objects, and they lack good counting-aware representation. This paper conducts a reproducibility study of 'Teaching CLIP to Count to Ten' (Paiss et al., 2023), which presents a method to finetune a CLIP model (Radford et al., 2021) to improve zero-shot counting accuracy in an image while maintaining the performance for zero-shot classification by introducing a counting-contrastive loss term. We improve the model's performance on a smaller subset of their training data with lower computational resources. We verify these claims by reproducing their study with our own code. The implementation can be found at https://github.com/SforAiDl/CountCLIP.
Paper Structure (11 sections, 7 equations, 5 figures, 1 table)

This paper contains 11 sections, 7 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: Figure copyright paiss2023teaching. (a) Training setup (b) Examples of counting images (c) Examples of noncounting images
  • Figure 2: (a) Class frequency (b) $log_{2}(frequency)$ (c) $log_{2}(log_{2}(frequency))$
  • Figure 3: Confusion matrices with early stopping for: (a)baseline (b) scheduler and $\lambda = 1$ (base model) (c) scheduler and $\lambda_{modal}$ and $L_{count+}$ (d) scheduler and $\lambda_{norm}$ and $L_{count+}$ (e) scheduler and $\lambda_{log}$ and $L_{count+}$
  • Figure 4: Confusion matrices for models trained till the end of the $10^{th}$ epoch for: (a)baseline (b) $\lambda = 1$ with no scheduler (c) scheduler and $\lambda = 1$ (base model) (d) $\lambda_{norm}$ (e) scheduler and $\lambda_{norm}$ (f) scheduler and $\lambda_{norm}$ and $L_{count+}$ (g) $\lambda_{modal}$ (h) scheduler and $\lambda_{modal}$ (i) scheduler and $\lambda_{modal}$ and $L_{count+}$ (j) $\lambda_{log}$ (k) scheduler and $\lambda_{log}$ (l) scheduler and $\lambda_{log}$ and $L_{count+}$
  • Figure 5: Confusion matrices for an early stopping mechanism selecting models with maximum validation accuracy for :(a)baseline (b) $\lambda = 1$ with no scheduler (c) scheduler and $\lambda = 1$ (base model) (d) $\lambda_{norm}$ (e) scheduler and $\lambda_{norm}$ (f) scheduler and $\lambda_{norm}$ and $L_{count+}$ (g) $\lambda_{modal}$ (h) scheduler and $\lambda_{modal}$ (i) scheduler and $\lambda_{modal}$ and $L_{count+}$ (j) $\lambda_{log}$ (k) scheduler and $\lambda_{log}$ (l) scheduler and $\lambda_{log}$ and $L_{count+}$