Table of Contents
Fetching ...

Few-shot Object Localization

Yunhan Ren, Bo Li, Chengyang Zhang, Yong Zhang, Baocai Yin

TL;DR

This work defines Few-shot Object Localization (FSOL), a task to localize objects in query images using only a few labeled support samples, addressing challenges like intra-class appearance variation and occlusion. It introduces a high-performance baseline built on a Dual-path Feature Augmentation (DFA) module, a 3D convolution for multi-branch fusion, and a Self Query (SQ) module to incorporate query information while reducing noise. The method demonstrates strong localization and counting performance on FSC-147, ShangHaiTech, and CARPK, with ablations confirming the effectiveness of DFA and SQ. The proposed approach offers a practical framework for object localization under limited data, showing competitive results with supervised methods on dense datasets and establishing a benchmark for future FSOL research.

Abstract

Existing object localization methods are tailored to locate specific classes of objects, relying heavily on abundant labeled data for model optimization. However, acquiring large amounts of labeled data is challenging in many real-world scenarios, significantly limiting the broader application of localization models. To bridge this research gap, this paper defines a novel task named Few-Shot Object Localization (FSOL), which aims to achieve precise localization with limited samples. This task achieves generalized object localization by leveraging a small number of labeled support samples to query the positional information of objects within corresponding images. To advance this field, we design an innovative high-performance baseline model. This model integrates a dual-path feature augmentation module to enhance shape association and gradient differences between supports and query images, alongside a self query module to explore the association between feature maps and query images. Experimental results demonstrate a significant performance improvement of our approach in the FSOL task, establishing an efficient benchmark for further research. All codes and data are available at https://github.com/Ryh1218/FSOL.

Few-shot Object Localization

TL;DR

This work defines Few-shot Object Localization (FSOL), a task to localize objects in query images using only a few labeled support samples, addressing challenges like intra-class appearance variation and occlusion. It introduces a high-performance baseline built on a Dual-path Feature Augmentation (DFA) module, a 3D convolution for multi-branch fusion, and a Self Query (SQ) module to incorporate query information while reducing noise. The method demonstrates strong localization and counting performance on FSC-147, ShangHaiTech, and CARPK, with ablations confirming the effectiveness of DFA and SQ. The proposed approach offers a practical framework for object localization under limited data, showing competitive results with supervised methods on dense datasets and establishing a benchmark for future FSOL research.

Abstract

Existing object localization methods are tailored to locate specific classes of objects, relying heavily on abundant labeled data for model optimization. However, acquiring large amounts of labeled data is challenging in many real-world scenarios, significantly limiting the broader application of localization models. To bridge this research gap, this paper defines a novel task named Few-Shot Object Localization (FSOL), which aims to achieve precise localization with limited samples. This task achieves generalized object localization by leveraging a small number of labeled support samples to query the positional information of objects within corresponding images. To advance this field, we design an innovative high-performance baseline model. This model integrates a dual-path feature augmentation module to enhance shape association and gradient differences between supports and query images, alongside a self query module to explore the association between feature maps and query images. Experimental results demonstrate a significant performance improvement of our approach in the FSOL task, establishing an efficient benchmark for further research. All codes and data are available at https://github.com/Ryh1218/FSOL.
Paper Structure (31 sections, 13 equations, 7 figures, 6 tables, 1 algorithm)

This paper contains 31 sections, 13 equations, 7 figures, 6 tables, 1 algorithm.

Figures (7)

  • Figure 1: Demonstration of the Few-Shot Object Localization (FSOL) task. During the training phase, the model predicts the location map based on given support samples and their corresponding query image. It then adjusts its parameters by by minimizing the Mean Squared Error loss between the ground truth and the predicted location map. In the testing phase, the trained model predicts the location map of novel class samples on corresponding query images that were not appear in the training phase.
  • Figure 2: Difficulties in few-shot object localization that cause negative influences: a) Appearance gap between intra-class objects; b) Object omission due to inter-object occlusion.
  • Figure 3: Demonstration of our FSOL Pipeline. Given the query and support images, the query feature $F_Q$ is extracted from the query image while the support feature $F_S$ is cropped from $F_Q$. The Dual-path Feature Augmentation (DFA) module first enhances deformation and gradient information in both $F_Q$ and $F_S$ through deformation and gradient branches, outputting the deformation-enhanced $F_Q^D$, $F_S^D$ as well as gradient-enhanced $F_Q^C$ and $F_S^C$. Then, DFA performs 3D convolution on the stacked $F_Q^D$ and $F_S^D$ using stacked $F_Q^C$ and $F_S^C$ as convolution kernel weights, obtaining the similarity map $S$ between query and support images. Then, the Self Query (SQ) module accepts $S$ as input and uses the original $F_Q$ to guide the object’s distribution information in $S$, subsequently outputting the optimized similarity map $S_{SQ}$. The $S_{SQ}$ then be sent to regression head to get the final location map.
  • Figure 4: Demonstration of two convolutional strategies for enhancing deformation and gradient information in support and query images: (a) Deformable Conv: Vanilla convolution utilizes fixed sampling points, potentially introducing noise, while deformable convolution adjusts sampling points, reducing background noise and improving adaptability; (b) CCD-Conv: Cross-center difference convolution computes differences between neighboring pixels around the central pixel and uses these differences as weights to generate the final output. This approach captures subtle image changes like texture, edges, and fine details.
  • Figure 5: Demonstration of Self Query (SQ) module. The SQ module enhances the model's perception of the object distribution by integrating information from the similarity map $S$ and the original query image features $F_Q$. Initially, it applies a shared convolution layer to both $S$ and $F_Q$, thereby introducing non-linearity and capturing similar patterns. Next, it calculates the cosine similarity between $S$ and $F_Q$ to obtain self query weights $W$. These weights are added element-wise to $S$, enabling the distribution information from $S$ to guide optimization through $F_Q$. Finally, after passing through another convolution layer, the SQ module generates the optimized similarity map $S_{SQ}$.
  • ...and 2 more figures