Table of Contents
Fetching ...

Camera clustering for scalable stream-based active distillation

Dani Manjah, Davide Cacciarelli, Christophe De Vleeschouwer, Benoit Macq

TL;DR

The paper tackles scalable object detection in city-scale video streams by combining self-training and knowledge distillation with camera-domain clustering. It introduces CSBAD, clustering camera streams based on cross-domain transfer, training $K$ cluster-specific lightweight Student models via a Teaching Server that pseudo-labels data with a universal Teacher $oldsymbol{\Theta}$, and a Top-Confidence SELECT strategy to pick high-confidence frames. On the WALT dataset, CSBAD improves $mAP_{50-95}$ compared with per-camera and universal baselines, and shows that top-confidence pseudo-labels reduce confirmation bias. The work provides design guidelines and discusses scalability, continuous deployment, and resource management for practical deployment in large-scale video analytics.

Abstract

We present a scalable framework designed to craft efficient lightweight models for video object detection utilizing self-training and knowledge distillation techniques. We scrutinize methodologies for the ideal selection of training images from video streams and the efficacy of model sharing across numerous cameras. By advocating for a camera clustering methodology, we aim to diminish the requisite number of models for training while augmenting the distillation dataset. The findings affirm that proper camera clustering notably amplifies the accuracy of distilled models, eclipsing the methodologies that employ distinct models for each camera or a universal model trained on the aggregate camera data.

Camera clustering for scalable stream-based active distillation

TL;DR

The paper tackles scalable object detection in city-scale video streams by combining self-training and knowledge distillation with camera-domain clustering. It introduces CSBAD, clustering camera streams based on cross-domain transfer, training cluster-specific lightweight Student models via a Teaching Server that pseudo-labels data with a universal Teacher , and a Top-Confidence SELECT strategy to pick high-confidence frames. On the WALT dataset, CSBAD improves compared with per-camera and universal baselines, and shows that top-confidence pseudo-labels reduce confirmation bias. The work provides design guidelines and discusses scalability, continuous deployment, and resource management for practical deployment in large-scale video analytics.

Abstract

We present a scalable framework designed to craft efficient lightweight models for video object detection utilizing self-training and knowledge distillation techniques. We scrutinize methodologies for the ideal selection of training images from video streams and the efficacy of model sharing across numerous cameras. By advocating for a camera clustering methodology, we aim to diminish the requisite number of models for training while augmenting the distillation dataset. The findings affirm that proper camera clustering notably amplifies the accuracy of distilled models, eclipsing the methodologies that employ distinct models for each camera or a universal model trained on the aggregate camera data.
Paper Structure (34 sections, 4 equations, 5 figures, 3 tables, 2 algorithms)

This paper contains 34 sections, 4 equations, 5 figures, 3 tables, 2 algorithms.

Figures (5)

  • Figure 1: Updates of video analytics models: Cameras (C1 to C7) send selected image data to a central server, which pseudo-labels, trains and updates specialized Students models for groups of similar cameras. These updated models are then sent back to their associated cameras.
  • Figure 2: Average mAP50-95 scores for different training sample sizes, using four SELECT models. The number of epochs is 100. Key observation are: 1) Top-Confidence is the best sampling strategy and 2) fine-tuned compact models can outperform a Teacher YOLOv8x6$^{\text{COCO}}$.
  • Figure 3: Clustering Definition. Fig. \ref{['fig:heatmap256threshtopconf']} depicts the cross-performance matrix $M$, where each element $M_{ij}$, $i,j=1,\cdots,9$, represents the mAP50-95 score of a model $\theta_i$ retrained on source domain $cam_i$ and evaluated on target domain $cam_j$. The setting is based on $B = 256$ images sampled using Top-Confidence. Fig \ref{['fig:dendogram']} is the associated dendogram.
  • Figure 4: Mean mAP50-95 scores per sample budget per stream ($B$) for varying numbers of clusters were observed. Models underwent training for 100 epochs with a batch size of 16. Key findings indicate that, at a constant complexity, a universal model ($K=1$) is preferable for lower $B$ values. However, for larger $B$ values, segmenting the system into two or three clusters yields superior outcomes.
  • Figure 5: Mean mAP50-95 scores are presented over log-scaled iterations ($T$) for each model. Markers denote sample sizes per stream ($B = 16, 96, 256$). Vertical lines indicate the epochs for $K = 1$. For $K > 1$, the epoch count is adjusted to maintain constant iteration counts across models, following Equation \ref{['eq:constantComplexity']}. Our analysis reveals that an increase in epoch counts benefits smaller cluster configurations ($K= 3, 5, 9$), enabling them to achieve comparable or superior performance to more universal configurations ($K = 1, 2$).