Re-Scoring Using Image-Language Similarity for Few-Shot Object Detection

Min Jae Jung; Seung Dae Han; Joohee Kim

Re-Scoring Using Image-Language Similarity for Few-Shot Object Detection

Min Jae Jung, Seung Dae Han, Joohee Kim

TL;DR

RISF addresses the challenge of detecting novel objects with very few labels by fusing image-language understanding with a loss-aware fine-tuning strategy. It introduces CM-CLIP, which re-scores detector outputs via CLIP-derived image-class similarities, and BNRL, a loss that mitigates missing annotations and hard negatives during training. The combination yields substantial improvements over prior FSOD and gFSOD methods on MS-COCO and Pascal VOC, validating the benefit of integrating vision-language priors into transfer-learning-based FSOD. While CM-CLIP incurs higher inference cost, the gains in novel-class detection suggest practical value for real-world few-shot recognition tasks, with potential refinements to reduce latency and further bolster base-class stability.

Abstract

Few-shot object detection, which focuses on detecting novel objects with few labels, is an emerging challenge in the community. Recent studies show that adapting a pre-trained model or modified loss function can improve performance. In this paper, we explore leveraging the power of Contrastive Language-Image Pre-training (CLIP) and hard negative classification loss in low data setting. Specifically, we propose Re-scoring using Image-language Similarity for Few-shot object detection (RISF) which extends Faster R-CNN by introducing Calibration Module using CLIP (CM-CLIP) and Background Negative Re-scale Loss (BNRL). The former adapts CLIP, which performs zero-shot classification, to re-score the classification scores of a detector using image-class similarities, the latter is modified classification loss considering the punishment for fake backgrounds as well as confusing categories on a generalized few-shot object detection dataset. Extensive experiments on MS-COCO and PASCAL VOC show that the proposed RISF substantially outperforms the state-of-the-art approaches. The code will be available.

Re-Scoring Using Image-Language Similarity for Few-Shot Object Detection

TL;DR

Abstract

Paper Structure (20 sections, 8 equations, 11 figures, 7 tables, 1 algorithm)

This paper contains 20 sections, 8 equations, 11 figures, 7 tables, 1 algorithm.

Introduction
Related Works
Few-shot learning
Few-Shot Object Detection
Joint Vision and Language Modeling
Methods
Problem definition
CM-CLIP
BNRL
Experiments
Experimental Setup
Comparison with State-of-the-Art
Ablation studies
Effectiveness of CM-CLIP and BNRL
Analysis of the CM-CLIP
...and 5 more sections

Figures (11)

Figure 1: CM-CLIP process scenario. Up: Few-shot object detector predicts bounding boxes and classification scores of object in input image. Down: utilizing pretrained CLIP model, CM-CLIP re-inference cropped images obtained from predicted bounding boxes. The resulting re-inferred output is subsequently integrated with the detector's inference outcome to yield the final score.
Figure 2: The missing annotation on FSOD protocol dataset. (a) shows an empirical example of missing annotations on MS-COCO in FSOD protocol indicated by the red-dotted boxes. In (b), we present the number of missing annotations on MS-COCO 10-shot training dataset in FSOD protocol.
Figure 3: The clean annotations on FSOD protocol dataset. (a) shows an empirical example of a specific seed with carefully selected annotations on MS-COCO in FSOD protocol. In (b), the model trained on random seeds with such missing annotations exhibits lower AP compared to the model trained on clean seed with very few missing annotations, which are obtained by carefully selected images. For the error bar, we report the average and 95% confidence.
Figure 4: Two classification score distributions. When the ground truth is "dog," the cross-entropy or focal loss of the two classification score (orange and blue) are the same. However, BNRL shows that the orange distribution is larger than the blue distribution. $\alpha$ is the scaling factor of the noise.
Figure 5: When $p_\alpha \propto \frac{1}{N} + \alpha x_{c}$, the graph of the mirror term is $- \sum \log(1-p(c))$. The mirror term is a monotonically increasing function depending on $\alpha$.
...and 6 more figures

Re-Scoring Using Image-Language Similarity for Few-Shot Object Detection

TL;DR

Abstract

Re-Scoring Using Image-Language Similarity for Few-Shot Object Detection

Authors

TL;DR

Abstract

Table of Contents

Figures (11)