CountCLIP -- [Re] Teaching CLIP to Count to Ten
Harshvardhan Mestha, Tejas Agrawal, Karan Bania, Shreyas V, Yash Bhisikar
TL;DR
This work investigates counting in Vision-Language Models by reproducing and extending the Counting-CLIP approach, finetuning CLIP with a counting loss to align image representations with count captions while preserving zero-shot capabilities. It introduces class-balanced lambda schemes and a CountPlus variant that contrasts against all incorrect counts, evaluating on the CountBench benchmark. Despite using a dramatically smaller counting dataset (∼2,000 examples) and constrained compute, the study shows performance gains over the baseline and provides public counting data and code, while highlighting issues in CountBench such as missing images. The results underscore the potential for reproducible, count-aware VLMs and emphasize the need for diverse counting data and robust benchmarks to better capture higher-count scenarios.
Abstract
Large vision-language models (VLMs) are shown to learn rich joint image-text representations enabling high performances in relevant downstream tasks. However, they fail to showcase their quantitative understanding of objects, and they lack good counting-aware representation. This paper conducts a reproducibility study of 'Teaching CLIP to Count to Ten' (Paiss et al., 2023), which presents a method to finetune a CLIP model (Radford et al., 2021) to improve zero-shot counting accuracy in an image while maintaining the performance for zero-shot classification by introducing a counting-contrastive loss term. We improve the model's performance on a smaller subset of their training data with lower computational resources. We verify these claims by reproducing their study with our own code. The implementation can be found at https://github.com/SforAiDl/CountCLIP.
