Cross-Architecture Auxiliary Feature Space Translation for Efficient Few-Shot Personalized Object Detection
Francesco Barbato, Umberto Michieli, Jijoong Moon, Pietro Zanuttigh, Mete Ozay
TL;DR
This work tackles on-device Few-Shot Instance-level Personalized Object Detection (FS-IPOD) by proposing AuXFT, which constructs an auxiliary feature space to distill knowledge from a self-supervised oracle into a detector without degrading its base performance. A Translator Block with Channel Differential and Spatial Differential aligns CNN features to SSL features, enabling Distillation into an auxiliary space, while Detection-Driven Feature Pooling creates per-detection embeddings for a conditional prototypical Few-Shot Learner. Empirical results across PerSeg, POD, iCubWorld, and CORe50 show substantial mAP gains with modest computational overhead, achieving up to ~80% of oracle performance at ~32% of inference time, ~13% VRAM, and ~19% model size, highlighting strong potential for practical on-device personalization. The approach demonstrates robust cross-architecture knowledge transfer and a scalable path for user-specific object personalization on resource-constrained robots and devices.
Abstract
Recent years have seen object detection robotic systems deployed in several personal devices (e.g., home robots and appliances). This has highlighted a challenge in their design, i.e., they cannot efficiently update their knowledge to distinguish between general classes and user-specific instances (e.g., a dog vs. user's dog). We refer to this challenging task as Instance-level Personalized Object Detection (IPOD). The personalization task requires many samples for model tuning and optimization in a centralized server, raising privacy concerns. An alternative is provided by approaches based on recent large-scale Foundation Models, but their compute costs preclude on-device applications. In our work we tackle both problems at the same time, designing a Few-Shot IPOD strategy called AuXFT. We introduce a conditional coarse-to-fine few-shot learner to refine the coarse predictions made by an efficient object detector, showing that using an off-the-shelf model leads to poor personalization due to neural collapse. Therefore, we introduce a Translator block that generates an auxiliary feature space where features generated by a self-supervised model (e.g., DINOv2) are distilled without impacting the performance of the detector. We validate AuXFT on three publicly available datasets and one in-house benchmark designed for the IPOD task, achieving remarkable gains in all considered scenarios with excellent time-complexity trade-off: AuXFT reaches a performance of 80% its upper bound at just 32% of the inference time, 13% of VRAM and 19% of the model size.
