Is CLIP the main roadblock for fine-grained open-world perception?

Lorenzo Bianchi; Fabio Carrara; Nicola Messina; Fabrizio Falchi

Is CLIP the main roadblock for fine-grained open-world perception?

Lorenzo Bianchi, Fabio Carrara, Nicola Messina, Fabrizio Falchi

TL;DR

This work probes why open-vocabulary object detectors struggle with fine-grained attribute discrimination in open-world perception, focusing on CLIP as the prevalent backbone. It evaluates CLIP against FG-OVD benchmarks, revealing that the primary bottleneck lies in the poor separability of object attributes within the CLIP latent space rather than localization errors. By adding lightweight layers on frozen CLIP encoders and training them with a two-stage scheme on COCO (warm-up) and FG-OVD (fine-tuning), the authors demonstrate that the necessary fine-grained information is indeed present and can be extracted with simple linear transformations, challenging the view that CLIP lacks attribute knowledge. The findings suggest that better-balanced pretraining and more expressive yet efficient matching functions could enable robust fine-grained open-world perception in practical settings like XR, robotics, and autonomous systems.

Abstract

Modern applications increasingly demand flexible computer vision models that adapt to novel concepts not encountered during training. This necessity is pivotal in emerging domains like extended reality, robotics, and autonomous driving, which require the ability to respond to open-world stimuli. A key ingredient is the ability to identify objects based on free-form textual queries defined at inference time - a task known as open-vocabulary object detection. Multimodal backbones like CLIP are the main enabling technology for current open-world perception solutions. Despite performing well on generic queries, recent studies highlighted limitations on the fine-grained recognition capabilities in open-vocabulary settings - i.e., for distinguishing subtle object features like color, shape, and material. In this paper, we perform a detailed examination of these open-vocabulary object recognition limitations to find the root cause. We evaluate the performance of CLIP, the most commonly used vision-language backbone, against a fine-grained object-matching benchmark, revealing interesting analogies between the limitations of open-vocabulary object detectors and their backbones. Experiments suggest that the lack of fine-grained understanding is caused by the poor separability of object characteristics in the CLIP latent space. Therefore, we try to understand whether fine-grained knowledge is present in CLIP embeddings but not exploited at inference time due, for example, to the unsuitability of the cosine similarity matching function, which may discard important object characteristics. Our preliminary experiments show that simple CLIP latent-space re-projections help separate fine-grained concepts, paving the way towards the development of backbones inherently able to process fine-grained details. The code for reproducing these experiments is available at https://github.com/lorebianchi98/FG-CLIP.

Is CLIP the main roadblock for fine-grained open-world perception?

TL;DR

Abstract

Paper Structure (19 sections, 8 equations, 3 figures, 1 table)

This paper contains 19 sections, 8 equations, 3 figures, 1 table.

Introduction
Related Work
Image-Text matching
Fine-grained understanding
Method
CLIP fine-grained evaluation
Latent Space Characteristics and Matching Approaches
Baseline (CLIP matching function)
Linear projection layer
Linear projection layer only above text encoder
Linear projection layer only above visual encoder
MLPs layer
Attention layer
Experiments
Dataset and Metrics
...and 4 more sections

Figures (3)

Figure 1: OVD (a) and FG-OVD (b): in the latter, fine-grained details about the categories to detect are given as free-form text in the input vocabulary.
Figure 2: FG-OVD Dataset for CLIP Matching. We leverage the Fine-Grained Open-Vocabulary object Detection (FG-OVD) benchmark suite and training set to investigate our two research questions Q1 and Q2. For each object, we extract the corresponding bounding box crop and compute its visual encoding using CLIP. Text embeddings are then generated for the assigned vocabulary entries (composed by positive + negative captions) associated with the object. Finally, we calculate the similarity and rank of the positive caption between the image crop and the vocabulary entries. To address Q1 we use a cosine similarity as model S, and the entire pipeline is used only during inference. To address Q2, we choose model S from the solutions described in \ref{['ssec:matching_approaches']} and train it on the FG-OVD training set.
Figure 3: CLIP vs. OWL in fine-grained understanding. We evaluate CLIP and OWL, configured as B/16 and L/14, against the Difficulty-based (first row) and Attribute-based (second row) FG-OVD benchmarks. The bar graph shows the Mean Rank of the positive label (lower is better), which represents the average position assigned by the model to the correct label within the overall vocabulary. Vocabulary lengths vary, with 3 for transparency, 8 for pattern, and 11 for other attributes.

Is CLIP the main roadblock for fine-grained open-world perception?

TL;DR

Abstract

Is CLIP the main roadblock for fine-grained open-world perception?

Authors

TL;DR

Abstract

Table of Contents

Figures (3)