Pixel-Wise Contrastive Distillation
Junqiang Huang, Zichao Guo
TL;DR
The paper tackles the challenge that small models under self-supervised pretraining lag on dense prediction tasks due to insufficient pixel-level transfer. It introduces Pixel-Wise Contrastive Distillation (PCD), which uses a MoCo-style pixel-level contrastive loss over corresponding pixels $(s_i,t_i)$ with a negatives queue and temperature $\tau$, augmented by a SpatialAdaptor that reshapes the teacher projection head for 2D maps and a lightweight Multi-Head Self-Attention (MHSA) module to enlarge the effective receptive field. Empirically, PCD outperforms prior self-supervised distillation methods on dense tasks, achieving $37.4$ AP$_{bbox}$ and $34.0$ AP$_{mask}$ on COCO with ResNet-18-FPN and $65.1\%$ top-1 in ImageNet linear probing, while showing robustness to teacher choice and backbone. Overall, PCD provides a practical pathway to pre-train compact models for dense prediction in a self-supervised fashion, highlighting the value of pixel-level supervision and spatially aware projection-head adaptation.
Abstract
We present a simple but effective pixel-level self-supervised distillation framework friendly to dense prediction tasks. Our method, called Pixel-Wise Contrastive Distillation (PCD), distills knowledge by attracting the corresponding pixels from student's and teacher's output feature maps. PCD includes a novel design called SpatialAdaptor which ``reshapes'' a part of the teacher network while preserving the distribution of its output features. Our ablation experiments suggest that this reshaping behavior enables more informative pixel-to-pixel distillation. Moreover, we utilize a plug-in multi-head self-attention module that explicitly relates the pixels of student's feature maps to enhance the effective receptive field, leading to a more competitive student. PCD \textbf{outperforms} previous self-supervised distillation methods on various dense prediction tasks. A backbone of \mbox{ResNet-18-FPN} distilled by PCD achieves $37.4$ AP$^\text{bbox}$ and $34.0$ AP$^\text{mask}$ on COCO dataset using the detector of \mbox{Mask R-CNN}. We hope our study will inspire future research on how to pre-train a small model friendly to dense prediction tasks in a self-supervised fashion.
