Table of Contents
Fetching ...

Generalized Semantic Contrastive Learning via Embedding Side Information for Few-Shot Object Detection

Ruoyu Chen, Hua Zhang, Jingzhi Li, Li Liu, Zhen Huang, Xiaochun Cao

TL;DR

This work addresses FSOD by introducing embedding side information to build a knowledge matrix that encodes semantic relations between base and novel categories. It integrates a Contextual Semantic Supervised Contrastive Learning (CCL) branch, a memory prototype bank, and a side-information guided counterfactual data augmentation to reduce feature-space bias and overfitting. The approach yields consistent improvements across multiple benchmarks (PASCAL VOC, MS COCO, LVIS V1, FSOD-1K, FSVOD-500) and backbones (ResNet and ViT), achieving state-of-the-art results in many settings. The combination of semantic-aware contrastive learning and interpretable augmentation offers a practical path to more robust FSOD in diverse, data-scarce scenarios.

Abstract

The objective of few-shot object detection (FSOD) is to detect novel objects with few training samples. The core challenge of this task is how to construct a generalized feature space for novel categories with limited data on the basis of the base category space, which could adapt the learned detection model to unknown scenarios. However, limited by insufficient samples for novel categories, two issues still exist: (1) the features of the novel category are easily implicitly represented by the features of the base category, leading to inseparable classifier boundaries, (2) novel categories with fewer data are not enough to fully represent the distribution, where the model fine-tuning is prone to overfitting. To address these issues, we introduce the side information to alleviate the negative influences derived from the feature space and sample viewpoints and formulate a novel generalized feature representation learning method for FSOD. Specifically, we first utilize embedding side information to construct a knowledge matrix to quantify the semantic relationship between the base and novel categories. Then, to strengthen the discrimination between semantically similar categories, we further develop contextual semantic supervised contrastive learning which embeds side information. Furthermore, to prevent overfitting problems caused by sparse samples, a side-information guided region-aware masked module is introduced to augment the diversity of samples, which finds and abandons biased information that discriminates between similar categories via counterfactual explanation, and refines the discriminative representation space further. Extensive experiments using ResNet and ViT backbones on PASCAL VOC, MS COCO, LVIS V1, FSOD-1K, and FSVOD-500 benchmarks demonstrate that our model outperforms the previous state-of-the-art methods, significantly improving the ability of FSOD in most shots/splits.

Generalized Semantic Contrastive Learning via Embedding Side Information for Few-Shot Object Detection

TL;DR

This work addresses FSOD by introducing embedding side information to build a knowledge matrix that encodes semantic relations between base and novel categories. It integrates a Contextual Semantic Supervised Contrastive Learning (CCL) branch, a memory prototype bank, and a side-information guided counterfactual data augmentation to reduce feature-space bias and overfitting. The approach yields consistent improvements across multiple benchmarks (PASCAL VOC, MS COCO, LVIS V1, FSOD-1K, FSVOD-500) and backbones (ResNet and ViT), achieving state-of-the-art results in many settings. The combination of semantic-aware contrastive learning and interpretable augmentation offers a practical path to more robust FSOD in diverse, data-scarce scenarios.

Abstract

The objective of few-shot object detection (FSOD) is to detect novel objects with few training samples. The core challenge of this task is how to construct a generalized feature space for novel categories with limited data on the basis of the base category space, which could adapt the learned detection model to unknown scenarios. However, limited by insufficient samples for novel categories, two issues still exist: (1) the features of the novel category are easily implicitly represented by the features of the base category, leading to inseparable classifier boundaries, (2) novel categories with fewer data are not enough to fully represent the distribution, where the model fine-tuning is prone to overfitting. To address these issues, we introduce the side information to alleviate the negative influences derived from the feature space and sample viewpoints and formulate a novel generalized feature representation learning method for FSOD. Specifically, we first utilize embedding side information to construct a knowledge matrix to quantify the semantic relationship between the base and novel categories. Then, to strengthen the discrimination between semantically similar categories, we further develop contextual semantic supervised contrastive learning which embeds side information. Furthermore, to prevent overfitting problems caused by sparse samples, a side-information guided region-aware masked module is introduced to augment the diversity of samples, which finds and abandons biased information that discriminates between similar categories via counterfactual explanation, and refines the discriminative representation space further. Extensive experiments using ResNet and ViT backbones on PASCAL VOC, MS COCO, LVIS V1, FSOD-1K, and FSVOD-500 benchmarks demonstrate that our model outperforms the previous state-of-the-art methods, significantly improving the ability of FSOD in most shots/splits.

Paper Structure

This paper contains 44 sections, 4 theorems, 35 equations, 15 figures, 13 tables.

Key Result

Lemma 1

For any sample $\mathbf{x}$, its upper bound is $B$, i.e., $\|\mathbf{x}\|\le B$. Let $\mathbf{X}^{N_b} = \{\mathbf{x}_i^{b}\}_{i=1}^{N_b}$ and $\mathbf{X}^{N_n} = \{\mathbf{x}_i^{n}\}_{i=1}^{N_n}$ be two set of i.i.d. samples from the base set $\mathcal{D}_{\mathrm{base}}$ and the novel set $\mathc where $\gamma_{\mathcal{F}}(\cdot, \cdot)$ is the integral probability metric muller1997integral th

Figures (15)

  • Figure 1: A. Motivation of our method. Traditional methods on FSOD only consider the feature representation between distinct categories, which would make the detection model sensitive to the training data distribution. Our model measures the differences between categories with the visual attributes (e.g., head, eye), which could learn the generalizable and discriminative feature representations. B. In the fine-tuning stage, the novel category may implicitly utilize the features of multiple base categories for representation, leading to a scatted feature space. Based on the knowledge matrix, contextual semantic supervised contrastive learning is developed to strengthen the space discrimination between semantically similar categories. C. Due to the scarcity of few-shot data, the distribution of the novel category cannot be fully represented, resulting in data bias and overfitting. We use the counterfactual explanation method and the masking mechanism to augment the few-shot data so that the mined sample features are closer to the decision boundary and jointly trained to improve the generalization of the model.
  • Figure 2: An overview of our few-shot object detection fine-tuning method. We first measure the similarity between the base category and the novel category using visual attributes, and represent it by constructing a knowledge matrix. During fine-tuning, the memory prototype bank continuously stores the complete features of all the categories. The proposed model leverages the Contextual Semantic Supervised Contrastive Learning (CCL) module and a knowledge matrix to learn generalized representations and improve discriminativeness. Specifically, the CCL module strengthens the distinction between proposal features and specific prototype categories, while the knowledge matrix enables the model to incorporate semantic relations between categories into its representations. The input image has a certain probability of data augmentation. Partially semantically similar counter categories are selected for the current category via the knowledge matrix, and saliency maps of region images are computed via counterfactual explanation. Under a threshold, the original image is erased according to the saliency map. The erased images aid in training the detector to improve its generalization capability and reduce the learning bias. The memory prototype bank is not updated with erased features.
  • Figure 3: Knowledge matrix for different cases. Relationships between categories were built through semantic information from different side information.
  • Figure 4: T-SNE visualization of the object proposal leaned by TFA, FSCE, and ours, we randomly select 100 proposals of each category. The novel categories are presented in italic font. Please zoom in for better visualization.
  • Figure 5: A. Comparison with different data augmentation strategies. Counterfactual data augmentation guided by a knowledge matrix achieves the best performance. B. Probability of data augmentation $\varepsilon$, the best performance is achieved when set to $5\%$. C. Erasing threshold $t$, the best performance is achieved when set to $0.8$.
  • ...and 10 more figures

Theorems & Definitions (8)

  • Definition 3.1: Contextual semantic supervised contrastive learning with side information
  • Definition 3.2: Knowledge Matrix
  • Lemma 1: yang2021bridging
  • Theorem A.1
  • proof
  • Proposition A.1
  • Theorem A.2
  • proof