Table of Contents
Fetching ...

Improving Contrastive Learning for Referring Expression Counting

Kostas Triaridis, Panagiotis Kaliosis, E-Ro Nguyen, Jingyi Xu, Hieu Le, Dimitris Samaras

TL;DR

This work tackles Referring Expression Counting by introducing C-REX, an image-space supervised contrastive learning framework that aligns object embeddings sharing the same class and referring expression while separating those with different attributes, all within a robust centroid-based detection baseline. By leveraging a large pool of negatives and a modified positive-anchor contrastive loss, C-REX achieves state-of-the-art REC results and strong performance in class-agnostic counting, outperforming prior image-text and density-based approaches. The method gains from re-purposing open-set detectors into centroid predictors and demonstrates broad applicability to related counting tasks, with ablations validating the effectiveness of the loss design and positive-selection strategy. Overall, C-REX advances fine-grained visual counting by combining simple, interpretable detection with targeted image-space contrastive learning, offering practical improvements for counting under complex contextual and attribute-based distinctions.

Abstract

Object counting has progressed from class-specific models, which count only known categories, to class-agnostic models that generalize to unseen categories. The next challenge is Referring Expression Counting (REC), where the goal is to count objects based on fine-grained attributes and contextual differences. Existing methods struggle with distinguishing visually similar objects that belong to the same category but correspond to different referring expressions. To address this, we propose C-REX, a novel contrastive learning framework, based on supervised contrastive learning, designed to enhance discriminative representation learning. Unlike prior works, C-REX operates entirely within the image space, avoiding the misalignment issues of image-text contrastive learning, thus providing a more stable contrastive signal. It also guarantees a significantly larger pool of negative samples, leading to improved robustness in the learned representations. Moreover, we showcase that our framework is versatile and generic enough to be applied to other similar tasks like class-agnostic counting. To support our approach, we analyze the key components of sota detection-based models and identify that detecting object centroids instead of bounding boxes is the key common factor behind their success in counting tasks. We use this insight to design a simple yet effective detection-based baseline to build upon. Our experiments show that C-REX achieves state-of-the-art results in REC, outperforming previous methods by more than 22\% in MAE and more than 10\% in RMSE, while also demonstrating strong performance in class-agnostic counting. Code is available at https://github.com/cvlab-stonybrook/c-rex.

Improving Contrastive Learning for Referring Expression Counting

TL;DR

This work tackles Referring Expression Counting by introducing C-REX, an image-space supervised contrastive learning framework that aligns object embeddings sharing the same class and referring expression while separating those with different attributes, all within a robust centroid-based detection baseline. By leveraging a large pool of negatives and a modified positive-anchor contrastive loss, C-REX achieves state-of-the-art REC results and strong performance in class-agnostic counting, outperforming prior image-text and density-based approaches. The method gains from re-purposing open-set detectors into centroid predictors and demonstrates broad applicability to related counting tasks, with ablations validating the effectiveness of the loss design and positive-selection strategy. Overall, C-REX advances fine-grained visual counting by combining simple, interpretable detection with targeted image-space contrastive learning, offering practical improvements for counting under complex contextual and attribute-based distinctions.

Abstract

Object counting has progressed from class-specific models, which count only known categories, to class-agnostic models that generalize to unseen categories. The next challenge is Referring Expression Counting (REC), where the goal is to count objects based on fine-grained attributes and contextual differences. Existing methods struggle with distinguishing visually similar objects that belong to the same category but correspond to different referring expressions. To address this, we propose C-REX, a novel contrastive learning framework, based on supervised contrastive learning, designed to enhance discriminative representation learning. Unlike prior works, C-REX operates entirely within the image space, avoiding the misalignment issues of image-text contrastive learning, thus providing a more stable contrastive signal. It also guarantees a significantly larger pool of negative samples, leading to improved robustness in the learned representations. Moreover, we showcase that our framework is versatile and generic enough to be applied to other similar tasks like class-agnostic counting. To support our approach, we analyze the key components of sota detection-based models and identify that detecting object centroids instead of bounding boxes is the key common factor behind their success in counting tasks. We use this insight to design a simple yet effective detection-based baseline to build upon. Our experiments show that C-REX achieves state-of-the-art results in REC, outperforming previous methods by more than 22\% in MAE and more than 10\% in RMSE, while also demonstrating strong performance in class-agnostic counting. Code is available at https://github.com/cvlab-stonybrook/c-rex.

Paper Structure

This paper contains 23 sections, 6 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Our proposed method C-REX. C-REX aligns embeddings of objects sharing the same class and referring expression while separating those with different expressions or classes.
  • Figure 2: Qualitative comparison of referring expression counting (REC) results across different methods. The first row shows the input images, while the second row contains ground truth annotations. The third, fourth, and fifth rows display predictions from GroundingREC groundingrec, our improved Grounding DINO baseline, and C-REX, respectively. Each column corresponds to a different referring expression. We observe that our method not only gets the most accurate counts, but it also counts the correct ground truth instances (i.e. the ones that were truly referred to by the given expression).
  • Figure 3: Quantitative comparison between GroundingREC, the improved GDino baseline and C-REX in the REC-8K test set by object count range. The number of samples for each bin is annotated below the bin's range as n. We observe that C-REX outperforms the two baseline models across all count ranges, with the results only being close for the RMSE higher count bin.
  • Figure 4: Quantitative comparison in terms of MAE between GroundingREC, our improved baseline and C-REX in the REC-8K test set for different RE categories. C-REX outperforms both baseline models across most categories, with the largest improvements shown in the action, orientation and location categories. We visualize categories with more than $30$ samples.
  • Figure 5: Some qualitative examples from the "location" and "orientation" attribute categories, for which our model vastly outperforms previous works. Here we can observe that our novel contrastive learning approach allows the model to disambiguate between fine-grained spatial attributes, only selecting instances from items in the "top" layer for the first image and cars driving to the "left" in the second image.
  • ...and 1 more figures