Mutually-Aware Feature Learning for Few-Shot Object Counting
Yerim Jeon, Subeen Lee, Jihwan Kim, Jae-Pil Heo
TL;DR
Few-shot object counting often suffers from target confusion when multiple classes appear in a scene. The authors propose Mutually-Aware FEAture learning (MAFEA), which injects early mutual relations between query and exemplar features via self- and cross-attention, augmented by a learnable background token and a Target-Background Discriminative loss to separate background from target cues. The approach demonstrates state-of-the-art performance on FSCD-LVIS and FSC-147 benchmarks and shows strong cross-dataset generalization to CARPK, with extensive ablations confirming the contributions of mutual relation modeling, background token, and TBD loss. Overall, MAFEA enables more accurate target-aware counting in complex multi-class scenes and across datasets, addressing a key limitation of extract-and-match counting methods.
Abstract
Few-shot object counting has garnered significant attention for its practicality as it aims to count target objects in a query image based on given exemplars without additional training. However, the prevailing extract-and-match approach has a shortcoming: query and exemplar features lack interaction during feature extraction since they are extracted independently and later correlated based on similarity. This can lead to insufficient target awareness and confusion in identifying the actual target when multiple class objects coexist. To address this, we propose a novel framework, Mutually-Aware FEAture learning (MAFEA), which encodes query and exemplar features with mutual awareness from the outset. By encouraging interaction throughout the pipeline, we obtain target-aware features robust to a multi-category scenario. Furthermore, we introduce background token to effectively associate the query's target region with exemplars and decouple its background region. Our extensive experiments demonstrate that our model achieves state-of-the-art performance on FSCD-LVIS and FSC-147 benchmarks with remarkably reduced target confusion.
