Table of Contents
Fetching ...

Cross-Architecture Auxiliary Feature Space Translation for Efficient Few-Shot Personalized Object Detection

Francesco Barbato, Umberto Michieli, Jijoong Moon, Pietro Zanuttigh, Mete Ozay

TL;DR

This work tackles on-device Few-Shot Instance-level Personalized Object Detection (FS-IPOD) by proposing AuXFT, which constructs an auxiliary feature space to distill knowledge from a self-supervised oracle into a detector without degrading its base performance. A Translator Block with Channel Differential and Spatial Differential aligns CNN features to SSL features, enabling Distillation into an auxiliary space, while Detection-Driven Feature Pooling creates per-detection embeddings for a conditional prototypical Few-Shot Learner. Empirical results across PerSeg, POD, iCubWorld, and CORe50 show substantial mAP gains with modest computational overhead, achieving up to ~80% of oracle performance at ~32% of inference time, ~13% VRAM, and ~19% model size, highlighting strong potential for practical on-device personalization. The approach demonstrates robust cross-architecture knowledge transfer and a scalable path for user-specific object personalization on resource-constrained robots and devices.

Abstract

Recent years have seen object detection robotic systems deployed in several personal devices (e.g., home robots and appliances). This has highlighted a challenge in their design, i.e., they cannot efficiently update their knowledge to distinguish between general classes and user-specific instances (e.g., a dog vs. user's dog). We refer to this challenging task as Instance-level Personalized Object Detection (IPOD). The personalization task requires many samples for model tuning and optimization in a centralized server, raising privacy concerns. An alternative is provided by approaches based on recent large-scale Foundation Models, but their compute costs preclude on-device applications. In our work we tackle both problems at the same time, designing a Few-Shot IPOD strategy called AuXFT. We introduce a conditional coarse-to-fine few-shot learner to refine the coarse predictions made by an efficient object detector, showing that using an off-the-shelf model leads to poor personalization due to neural collapse. Therefore, we introduce a Translator block that generates an auxiliary feature space where features generated by a self-supervised model (e.g., DINOv2) are distilled without impacting the performance of the detector. We validate AuXFT on three publicly available datasets and one in-house benchmark designed for the IPOD task, achieving remarkable gains in all considered scenarios with excellent time-complexity trade-off: AuXFT reaches a performance of 80% its upper bound at just 32% of the inference time, 13% of VRAM and 19% of the model size.

Cross-Architecture Auxiliary Feature Space Translation for Efficient Few-Shot Personalized Object Detection

TL;DR

This work tackles on-device Few-Shot Instance-level Personalized Object Detection (FS-IPOD) by proposing AuXFT, which constructs an auxiliary feature space to distill knowledge from a self-supervised oracle into a detector without degrading its base performance. A Translator Block with Channel Differential and Spatial Differential aligns CNN features to SSL features, enabling Distillation into an auxiliary space, while Detection-Driven Feature Pooling creates per-detection embeddings for a conditional prototypical Few-Shot Learner. Empirical results across PerSeg, POD, iCubWorld, and CORe50 show substantial mAP gains with modest computational overhead, achieving up to ~80% of oracle performance at ~32% of inference time, ~13% VRAM, and ~19% model size, highlighting strong potential for practical on-device personalization. The approach demonstrates robust cross-architecture knowledge transfer and a scalable path for user-specific object personalization on resource-constrained robots and devices.

Abstract

Recent years have seen object detection robotic systems deployed in several personal devices (e.g., home robots and appliances). This has highlighted a challenge in their design, i.e., they cannot efficiently update their knowledge to distinguish between general classes and user-specific instances (e.g., a dog vs. user's dog). We refer to this challenging task as Instance-level Personalized Object Detection (IPOD). The personalization task requires many samples for model tuning and optimization in a centralized server, raising privacy concerns. An alternative is provided by approaches based on recent large-scale Foundation Models, but their compute costs preclude on-device applications. In our work we tackle both problems at the same time, designing a Few-Shot IPOD strategy called AuXFT. We introduce a conditional coarse-to-fine few-shot learner to refine the coarse predictions made by an efficient object detector, showing that using an off-the-shelf model leads to poor personalization due to neural collapse. Therefore, we introduce a Translator block that generates an auxiliary feature space where features generated by a self-supervised model (e.g., DINOv2) are distilled without impacting the performance of the detector. We validate AuXFT on three publicly available datasets and one in-house benchmark designed for the IPOD task, achieving remarkable gains in all considered scenarios with excellent time-complexity trade-off: AuXFT reaches a performance of 80% its upper bound at just 32% of the inference time, 13% of VRAM and 19% of the model size.
Paper Structure (14 sections, 5 equations, 4 figures, 6 tables)

This paper contains 14 sections, 5 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: We explore few-shot instance-level personalized object detection in constrained robotic applications. Top-left: standard pre-trained models suffer from neural collapse, hence the FSL personalization fails. Top-right: naïve knowledge distillation degrades detection performance significantly. Bottom left: Foundation Model-driven approaches use compute-heavy vision-text architectures as guidance for feature pooling. Bottom right (ours): we create an auxiliary feature space where teacher knowledge is distilled for FSL personalization, without impacting detection performance.
  • Figure 2: Overview of our AuXFT. The descriptive features produced by the oracle network are distilled in the auxiliary space through the Translator Block during training. At inference time, the oracle is discarded. For personalization, the predicted boxes and features pass through the DDFP, whose output is fed to the FSL. User input is only needed for FSL training.
  • Figure 3: Auxiliary Features visualization. To generate the colors, during training, we fit an embedding of the oracle features $\mathbf{O}$ in $[0,1]^3$ (RGB space), the same mapping is used to color $\mathbf{R}_i$. The cosine similarity is computed pointwise.
  • Figure 4: Personalization performance varying the ground truth boxes padding size on PerSeg.