Camera clustering for scalable stream-based active distillation
Dani Manjah, Davide Cacciarelli, Christophe De Vleeschouwer, Benoit Macq
TL;DR
The paper tackles scalable object detection in city-scale video streams by combining self-training and knowledge distillation with camera-domain clustering. It introduces CSBAD, clustering camera streams based on cross-domain transfer, training $K$ cluster-specific lightweight Student models via a Teaching Server that pseudo-labels data with a universal Teacher $oldsymbol{\Theta}$, and a Top-Confidence SELECT strategy to pick high-confidence frames. On the WALT dataset, CSBAD improves $mAP_{50-95}$ compared with per-camera and universal baselines, and shows that top-confidence pseudo-labels reduce confirmation bias. The work provides design guidelines and discusses scalability, continuous deployment, and resource management for practical deployment in large-scale video analytics.
Abstract
We present a scalable framework designed to craft efficient lightweight models for video object detection utilizing self-training and knowledge distillation techniques. We scrutinize methodologies for the ideal selection of training images from video streams and the efficacy of model sharing across numerous cameras. By advocating for a camera clustering methodology, we aim to diminish the requisite number of models for training while augmenting the distillation dataset. The findings affirm that proper camera clustering notably amplifies the accuracy of distilled models, eclipsing the methodologies that employ distinct models for each camera or a universal model trained on the aggregate camera data.
