Table of Contents
Fetching ...

Beyond Task-Driven Features for Object Detection

Meilun Zhou, Alina Zare

Abstract

Task-driven features learned by modern object detectors optimize end task loss yet often capture shortcut correlations that fail to reflect underlying annotation structure. Such representations limit transfer, interpretability, and robustness when task definitions change or supervision becomes sparse. This paper introduces an annotation-guided feature augmentation framework that injects embeddings into an object detection backbone. The method constructs dense spatial feature grids from annotation-guided latent spaces and fuses them with feature pyramid representations to influence region proposal and detection heads. Experiments across wildlife and remote sensing datasets evaluate classification, localization, and data efficiency under multiple supervision regimes. Results show consistent improvements in object focus, reduced background sensitivity, and stronger generalization to unseen or weakly supervised tasks. The findings demonstrate that aligning features with annotation geometry yields more meaningful representations than purely task optimized features.

Beyond Task-Driven Features for Object Detection

Abstract

Task-driven features learned by modern object detectors optimize end task loss yet often capture shortcut correlations that fail to reflect underlying annotation structure. Such representations limit transfer, interpretability, and robustness when task definitions change or supervision becomes sparse. This paper introduces an annotation-guided feature augmentation framework that injects embeddings into an object detection backbone. The method constructs dense spatial feature grids from annotation-guided latent spaces and fuses them with feature pyramid representations to influence region proposal and detection heads. Experiments across wildlife and remote sensing datasets evaluate classification, localization, and data efficiency under multiple supervision regimes. Results show consistent improvements in object focus, reduced background sensitivity, and stronger generalization to unseen or weakly supervised tasks. The findings demonstrate that aligning features with annotation geometry yields more meaningful representations than purely task optimized features.

Paper Structure

This paper contains 11 sections, 7 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: Two-dimensional PCA projections of embeddings extracted from the Faster R-CNN backbone, CLIP, DTL, and MATL. The MATL embedding separates class labels while preserving intra-class structure.
  • Figure 2: Brighter regions indicate higher values of the first principal component computed from the dense latent feature grids produced by sliding-window embedding and backbone-aligned projection. MATL concentrates activation on object-relevant regions more consistently than CLIP or DTL which both exhibit stronger responses to background areas.
  • Figure 3: In this scene, MATL correctly localizes and classifies the three deer, while both the baseline and DTL models additionally produce a false positive cow prediction. Faster R-CNN augmented with MATL also produces boxes with higher IoU with the ground truth. The predictions all have a confidence threshold of 0.5 or higher.
  • Figure 4: Receiver operating characteristic (ROC) curves comparing detection performance using base R-CNN, DTL-guided features, and MATL-guided features against a chance baseline. AUC scores are provided in the legend.