Table of Contents
Fetching ...

Smooth and Stepwise Self-Distillation for Object Detection

Jieren Deng, Xin Zhou, Hao Tian, Zhihong Pan, Derek Aguiar

TL;DR

The paper tackles limitations of pre-trained teacher-based distillation in object detection by proposing Smooth and Stepwise Self-Distillation (SSSD), which uses a Jensen-Shannon divergence-based distillation loss and an adaptive, stepwise distillation coefficient tied to the learning-rate schedule. By constructing an implicit teacher from label annotations and backbone features within a feature pyramid network, SSSD distills information through a smooth, bounded loss $L_{distill} = \frac{1}{N} \sum^{P}_{p=1} \left( D_{JS}(\\bm{k}^p || \\bm{k}^{p}_e) \right)^{\frac{1}{2}}$ and combines it with the detection objective as $L_{total} = L_{det} + \lambda L_{distill}$; the stepwise strategy changes $\\lambda$ at 120,000 iterations to sustain distillation impact during LR decay. Extensive COCO benchmarks across backbones (e.g., ResNet-50/101) and detectors (Faster R-CNN, RetinaNet, FCOS) show that SSSD generally yields higher AP than LabelEnc and LGD, with notable gains for small and medium objects and robustness to different lambda values. The stepwise distillation variant further boosts AP by about 0.5% over fixed-$\\lambda$ configurations, underscoring the practical value of adapting distillation strength during training. Overall, SSSD advances self-distillation for object detection by delivering consistent improvements without large teachers and by leveraging a smooth JS-based loss plus a principled stepwise coefficient schedule.

Abstract

Distilling the structured information captured in feature maps has contributed to improved results for object detection tasks, but requires careful selection of baseline architectures and substantial pre-training. Self-distillation addresses these limitations and has recently achieved state-of-the-art performance for object detection despite making several simplifying architectural assumptions. Building on this work, we propose Smooth and Stepwise Self-Distillation (SSSD) for object detection. Our SSSD architecture forms an implicit teacher from object labels and a feature pyramid network backbone to distill label-annotated feature maps using Jensen-Shannon distance, which is smoother than distillation losses used in prior work. We additionally add a distillation coefficient that is adaptively configured based on the learning rate. We extensively benchmark SSSD against a baseline and two state-of-the-art object detector architectures on the COCO dataset by varying the coefficients and backbone and detector networks. We demonstrate that SSSD achieves higher average precision in most experimental settings, is robust to a wide range of coefficients, and benefits from our stepwise distillation procedure.

Smooth and Stepwise Self-Distillation for Object Detection

TL;DR

The paper tackles limitations of pre-trained teacher-based distillation in object detection by proposing Smooth and Stepwise Self-Distillation (SSSD), which uses a Jensen-Shannon divergence-based distillation loss and an adaptive, stepwise distillation coefficient tied to the learning-rate schedule. By constructing an implicit teacher from label annotations and backbone features within a feature pyramid network, SSSD distills information through a smooth, bounded loss and combines it with the detection objective as ; the stepwise strategy changes at 120,000 iterations to sustain distillation impact during LR decay. Extensive COCO benchmarks across backbones (e.g., ResNet-50/101) and detectors (Faster R-CNN, RetinaNet, FCOS) show that SSSD generally yields higher AP than LabelEnc and LGD, with notable gains for small and medium objects and robustness to different lambda values. The stepwise distillation variant further boosts AP by about 0.5% over fixed- configurations, underscoring the practical value of adapting distillation strength during training. Overall, SSSD advances self-distillation for object detection by delivering consistent improvements without large teachers and by leveraging a smooth JS-based loss plus a principled stepwise coefficient schedule.

Abstract

Distilling the structured information captured in feature maps has contributed to improved results for object detection tasks, but requires careful selection of baseline architectures and substantial pre-training. Self-distillation addresses these limitations and has recently achieved state-of-the-art performance for object detection despite making several simplifying architectural assumptions. Building on this work, we propose Smooth and Stepwise Self-Distillation (SSSD) for object detection. Our SSSD architecture forms an implicit teacher from object labels and a feature pyramid network backbone to distill label-annotated feature maps using Jensen-Shannon distance, which is smoother than distillation losses used in prior work. We additionally add a distillation coefficient that is adaptively configured based on the learning rate. We extensively benchmark SSSD against a baseline and two state-of-the-art object detector architectures on the COCO dataset by varying the coefficients and backbone and detector networks. We demonstrate that SSSD achieves higher average precision in most experimental settings, is robust to a wide range of coefficients, and benefits from our stepwise distillation procedure.
Paper Structure (9 sections, 7 equations, 3 figures, 4 tables)

This paper contains 9 sections, 7 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Smooth and Stepwise Self-Distillation ( sssd). The feature maps ($\bm{K}$) extracted from the backbone (ResNet-50) are sent to the fusion component along with the ground truth annotations. The distillation loss ($L_{distill}$) is calculated using the feature maps and label enhanced feature maps ($\bm{K_e}$). The detection loss ($L_{det}$) is calculated as classification and bounding-box regression losses by a shared detection head.
  • Figure 2: Performance comparison with different $\lambda$. After calibrating the distillation loss, the AP for sssd with $\lambda=75$ (Ours$_{75}$) is higher than LGD configurations. The learning rates for each architecture are $0$ after iteration $17 \times 10^4$.
  • Figure 3: Stepwise self-distillation comparisons. The stepwise self-distillation strategy for both LGD and sssd (Ours) improves final AP over a fixed $\lambda$.