Smooth and Stepwise Self-Distillation for Object Detection
Jieren Deng, Xin Zhou, Hao Tian, Zhihong Pan, Derek Aguiar
TL;DR
The paper tackles limitations of pre-trained teacher-based distillation in object detection by proposing Smooth and Stepwise Self-Distillation (SSSD), which uses a Jensen-Shannon divergence-based distillation loss and an adaptive, stepwise distillation coefficient tied to the learning-rate schedule. By constructing an implicit teacher from label annotations and backbone features within a feature pyramid network, SSSD distills information through a smooth, bounded loss $L_{distill} = \frac{1}{N} \sum^{P}_{p=1} \left( D_{JS}(\\bm{k}^p || \\bm{k}^{p}_e) \right)^{\frac{1}{2}}$ and combines it with the detection objective as $L_{total} = L_{det} + \lambda L_{distill}$; the stepwise strategy changes $\\lambda$ at 120,000 iterations to sustain distillation impact during LR decay. Extensive COCO benchmarks across backbones (e.g., ResNet-50/101) and detectors (Faster R-CNN, RetinaNet, FCOS) show that SSSD generally yields higher AP than LabelEnc and LGD, with notable gains for small and medium objects and robustness to different lambda values. The stepwise distillation variant further boosts AP by about 0.5% over fixed-$\\lambda$ configurations, underscoring the practical value of adapting distillation strength during training. Overall, SSSD advances self-distillation for object detection by delivering consistent improvements without large teachers and by leveraging a smooth JS-based loss plus a principled stepwise coefficient schedule.
Abstract
Distilling the structured information captured in feature maps has contributed to improved results for object detection tasks, but requires careful selection of baseline architectures and substantial pre-training. Self-distillation addresses these limitations and has recently achieved state-of-the-art performance for object detection despite making several simplifying architectural assumptions. Building on this work, we propose Smooth and Stepwise Self-Distillation (SSSD) for object detection. Our SSSD architecture forms an implicit teacher from object labels and a feature pyramid network backbone to distill label-annotated feature maps using Jensen-Shannon distance, which is smoother than distillation losses used in prior work. We additionally add a distillation coefficient that is adaptively configured based on the learning rate. We extensively benchmark SSSD against a baseline and two state-of-the-art object detector architectures on the COCO dataset by varying the coefficients and backbone and detector networks. We demonstrate that SSSD achieves higher average precision in most experimental settings, is robust to a wide range of coefficients, and benefits from our stepwise distillation procedure.
