Mutually-Aware Feature Learning for Few-Shot Object Counting

Yerim Jeon; Subeen Lee; Jihwan Kim; Jae-Pil Heo

Mutually-Aware Feature Learning for Few-Shot Object Counting

Yerim Jeon, Subeen Lee, Jihwan Kim, Jae-Pil Heo

TL;DR

Few-shot object counting often suffers from target confusion when multiple classes appear in a scene. The authors propose Mutually-Aware FEAture learning (MAFEA), which injects early mutual relations between query and exemplar features via self- and cross-attention, augmented by a learnable background token and a Target-Background Discriminative loss to separate background from target cues. The approach demonstrates state-of-the-art performance on FSCD-LVIS and FSC-147 benchmarks and shows strong cross-dataset generalization to CARPK, with extensive ablations confirming the contributions of mutual relation modeling, background token, and TBD loss. Overall, MAFEA enables more accurate target-aware counting in complex multi-class scenes and across datasets, addressing a key limitation of extract-and-match counting methods.

Abstract

Few-shot object counting has garnered significant attention for its practicality as it aims to count target objects in a query image based on given exemplars without additional training. However, the prevailing extract-and-match approach has a shortcoming: query and exemplar features lack interaction during feature extraction since they are extracted independently and later correlated based on similarity. This can lead to insufficient target awareness and confusion in identifying the actual target when multiple class objects coexist. To address this, we propose a novel framework, Mutually-Aware FEAture learning (MAFEA), which encodes query and exemplar features with mutual awareness from the outset. By encouraging interaction throughout the pipeline, we obtain target-aware features robust to a multi-category scenario. Furthermore, we introduce background token to effectively associate the query's target region with exemplars and decouple its background region. Our extensive experiments demonstrate that our model achieves state-of-the-art performance on FSCD-LVIS and FSC-147 benchmarks with remarkably reduced target confusion.

Mutually-Aware Feature Learning for Few-Shot Object Counting

TL;DR

Abstract

Paper Structure (31 sections, 11 equations, 9 figures, 9 tables)

This paper contains 31 sections, 11 equations, 9 figures, 9 tables.

Introduction
Related Work
Class-Specific Object Counting
Few-Shot and Zero-Shot Object Counting
Overall Pipeline
Mutual Relation Modeling
Background Token
Target-Background Discriminative Loss
Training Loss
Experiments
Implementation Details
Architecture.
Training details.
Datasets and Metrics.
Datasets.
...and 16 more sections

Figures (9)

Figure 1: Target confusion problem in a multi-class scenario. Each box in the query image is a box annotation of an exemplar. While SAFECount and LOCA count all objects in the query image regardless of the given exemplar images, MAFEA accurately distinguishes target objects based on the exemplars.
Figure 2: Comparison between Extract-and-Match methods and our proposed MAFEA. (a) Existing methods extract query and exemplar features without any explicit feedback to each other. (b) On the other hand, MAFEA produces the query and exemplar features based on their mutual relation from an early stage of the feature extractor. By integrating self-relations and bi-directional co-relations, MAFEA produces highly target-aware features. Moreover, the learnable background token is fed into the self- and co-relations with the exemplar features to represent the background regions of the query image. (c) Self- and Cross-Relations are implemented by self- and cross-attention mechanisms.
Figure 3: Qualitative results: 1st and 2nd rows from FSCD-LVIS dataset, and the 3rd and 4th rows from FSC-147 dataset. Each box in the query image is a box annotation for an exemplar, while the numbers in the pictures are the counting results. Best viewed with zoom-in.
Figure 4: Qualitative results for a multi-class scenario. Each box in the query image is a box annotation for an exemplar, while the numbers in the pictures are the counting results. Best viewed with zoom-in.
Figure 5: Results in multi-class scenes within the FSC-147 dataset. From left to right: query image, target region map, ground-truth density map, our prediction on all region, target region, and non-target region. Each box in the query image is a box annotation for each exemplar image, while the numbers in images are the counting results.
...and 4 more figures

Mutually-Aware Feature Learning for Few-Shot Object Counting

TL;DR

Abstract

Mutually-Aware Feature Learning for Few-Shot Object Counting

Authors

TL;DR

Abstract

Table of Contents

Figures (9)