Table of Contents
Fetching ...

Swiss DINO: Efficient and Versatile Vision Framework for On-device Personal Object Search

Kirill Paramonov, Jia-Xing Zhong, Umberto Michieli, Jijoong Moon, Mete Ozay

TL;DR

This work tackles on-device personal object search for robotic vision, enabling localization and identification of user-specific items from very few references. It introduces Swiss DINO, a lightweight, training-free framework built on the DINOv2 backbone that uses patch-level features and prototype-based scoring to perform open-set segmentation and detection without adaptation training. The authors formalize a three-stage problem (pre-training, on-device personalization, open-set inference) and demonstrate up to 55% gains in segmentation/identification with substantial footprint reductions (up to 100x faster backbone inference and 10x less GPU memory) compared to heavy transformer-based baselines. The approach yields practical benefits for home robots and mobile devices by enabling accurate, resource-efficient personalized object search, with strong results on iCubWorld and PerSEG and extensive ablations validating design choices and parameter sensitivities.

Abstract

In this paper, we address a recent trend in robotic home appliances to include vision systems on personal devices, capable of personalizing the appliances on the fly. In particular, we formulate and address an important technical task of personal object search, which involves localization and identification of personal items of interest on images captured by robotic appliances, with each item referenced only by a few annotated images. The task is crucial for robotic home appliances and mobile systems, which need to process personal visual scenes or to operate with particular personal objects (e.g., for grasping or navigation). In practice, personal object search presents two main technical challenges. First, a robot vision system needs to be able to distinguish between many fine-grained classes, in the presence of occlusions and clutter. Second, the strict resource requirements for the on-device system restrict the usage of most state-of-the-art methods for few-shot learning and often prevent on-device adaptation. In this work, we propose Swiss DINO: a simple yet effective framework for one-shot personal object search based on the recent DINOv2 transformer model, which was shown to have strong zero-shot generalization properties. Swiss DINO handles challenging on-device personalized scene understanding requirements and does not require any adaptation training. We show significant improvement (up to 55%) in segmentation and recognition accuracy compared to the common lightweight solutions, and significant footprint reduction of backbone inference time (up to 100x) and GPU consumption (up to 10x) compared to the heavy transformer-based solutions.

Swiss DINO: Efficient and Versatile Vision Framework for On-device Personal Object Search

TL;DR

This work tackles on-device personal object search for robotic vision, enabling localization and identification of user-specific items from very few references. It introduces Swiss DINO, a lightweight, training-free framework built on the DINOv2 backbone that uses patch-level features and prototype-based scoring to perform open-set segmentation and detection without adaptation training. The authors formalize a three-stage problem (pre-training, on-device personalization, open-set inference) and demonstrate up to 55% gains in segmentation/identification with substantial footprint reductions (up to 100x faster backbone inference and 10x less GPU memory) compared to heavy transformer-based baselines. The approach yields practical benefits for home robots and mobile devices by enabling accurate, resource-efficient personalized object search, with strong results on iCubWorld and PerSEG and extensive ablations validating design choices and parameter sensitivities.

Abstract

In this paper, we address a recent trend in robotic home appliances to include vision systems on personal devices, capable of personalizing the appliances on the fly. In particular, we formulate and address an important technical task of personal object search, which involves localization and identification of personal items of interest on images captured by robotic appliances, with each item referenced only by a few annotated images. The task is crucial for robotic home appliances and mobile systems, which need to process personal visual scenes or to operate with particular personal objects (e.g., for grasping or navigation). In practice, personal object search presents two main technical challenges. First, a robot vision system needs to be able to distinguish between many fine-grained classes, in the presence of occlusions and clutter. Second, the strict resource requirements for the on-device system restrict the usage of most state-of-the-art methods for few-shot learning and often prevent on-device adaptation. In this work, we propose Swiss DINO: a simple yet effective framework for one-shot personal object search based on the recent DINOv2 transformer model, which was shown to have strong zero-shot generalization properties. Swiss DINO handles challenging on-device personalized scene understanding requirements and does not require any adaptation training. We show significant improvement (up to 55%) in segmentation and recognition accuracy compared to the common lightweight solutions, and significant footprint reduction of backbone inference time (up to 100x) and GPU consumption (up to 10x) compared to the heavy transformer-based solutions.
Paper Structure (39 sections, 14 equations, 2 figures, 4 tables)

This paper contains 39 sections, 14 equations, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Comparison with semantic segmentation methods. Left: common adaptive semantic segmentation methods are adapting models to coarse datasets and do not account for multiple personal objects or unseen personal objects on a scene, thus generating false positive errors. Right: our Swiss DINO avoids false positive errors by performing open-set classification on parts of the image prior to generating segmentation masks.
  • Figure 2: High-level overview of our Swiss DINO system.