Table of Contents
Fetching ...

Pixel-Wise Contrastive Distillation

Junqiang Huang, Zichao Guo

TL;DR

The paper tackles the challenge that small models under self-supervised pretraining lag on dense prediction tasks due to insufficient pixel-level transfer. It introduces Pixel-Wise Contrastive Distillation (PCD), which uses a MoCo-style pixel-level contrastive loss over corresponding pixels $(s_i,t_i)$ with a negatives queue and temperature $\tau$, augmented by a SpatialAdaptor that reshapes the teacher projection head for 2D maps and a lightweight Multi-Head Self-Attention (MHSA) module to enlarge the effective receptive field. Empirically, PCD outperforms prior self-supervised distillation methods on dense tasks, achieving $37.4$ AP$_{bbox}$ and $34.0$ AP$_{mask}$ on COCO with ResNet-18-FPN and $65.1\%$ top-1 in ImageNet linear probing, while showing robustness to teacher choice and backbone. Overall, PCD provides a practical pathway to pre-train compact models for dense prediction in a self-supervised fashion, highlighting the value of pixel-level supervision and spatially aware projection-head adaptation.

Abstract

We present a simple but effective pixel-level self-supervised distillation framework friendly to dense prediction tasks. Our method, called Pixel-Wise Contrastive Distillation (PCD), distills knowledge by attracting the corresponding pixels from student's and teacher's output feature maps. PCD includes a novel design called SpatialAdaptor which ``reshapes'' a part of the teacher network while preserving the distribution of its output features. Our ablation experiments suggest that this reshaping behavior enables more informative pixel-to-pixel distillation. Moreover, we utilize a plug-in multi-head self-attention module that explicitly relates the pixels of student's feature maps to enhance the effective receptive field, leading to a more competitive student. PCD \textbf{outperforms} previous self-supervised distillation methods on various dense prediction tasks. A backbone of \mbox{ResNet-18-FPN} distilled by PCD achieves $37.4$ AP$^\text{bbox}$ and $34.0$ AP$^\text{mask}$ on COCO dataset using the detector of \mbox{Mask R-CNN}. We hope our study will inspire future research on how to pre-train a small model friendly to dense prediction tasks in a self-supervised fashion.

Pixel-Wise Contrastive Distillation

TL;DR

The paper tackles the challenge that small models under self-supervised pretraining lag on dense prediction tasks due to insufficient pixel-level transfer. It introduces Pixel-Wise Contrastive Distillation (PCD), which uses a MoCo-style pixel-level contrastive loss over corresponding pixels with a negatives queue and temperature , augmented by a SpatialAdaptor that reshapes the teacher projection head for 2D maps and a lightweight Multi-Head Self-Attention (MHSA) module to enlarge the effective receptive field. Empirically, PCD outperforms prior self-supervised distillation methods on dense tasks, achieving AP and AP on COCO with ResNet-18-FPN and top-1 in ImageNet linear probing, while showing robustness to teacher choice and backbone. Overall, PCD provides a practical pathway to pre-train compact models for dense prediction in a self-supervised fashion, highlighting the value of pixel-level supervision and spatially aware projection-head adaptation.

Abstract

We present a simple but effective pixel-level self-supervised distillation framework friendly to dense prediction tasks. Our method, called Pixel-Wise Contrastive Distillation (PCD), distills knowledge by attracting the corresponding pixels from student's and teacher's output feature maps. PCD includes a novel design called SpatialAdaptor which ``reshapes'' a part of the teacher network while preserving the distribution of its output features. Our ablation experiments suggest that this reshaping behavior enables more informative pixel-to-pixel distillation. Moreover, we utilize a plug-in multi-head self-attention module that explicitly relates the pixels of student's feature maps to enhance the effective receptive field, leading to a more competitive student. PCD \textbf{outperforms} previous self-supervised distillation methods on various dense prediction tasks. A backbone of \mbox{ResNet-18-FPN} distilled by PCD achieves AP and AP on COCO dataset using the detector of \mbox{Mask R-CNN}. We hope our study will inspire future research on how to pre-train a small model friendly to dense prediction tasks in a self-supervised fashion.
Paper Structure (16 sections, 7 equations, 4 figures, 6 tables)

This paper contains 16 sections, 7 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Architecture of Pixel-Wise Contrastive Distillation
  • Figure 2: Workflows of Teacher's Projection Head
  • Figure 4: Effective receptive field. R50 and R18 are ResNet-50 and ResNet-18 respectively. R18 w/ MHSA stands for ResNet-18 enhanced by a multi-head self-attention module.
  • Figure 5: Visualization of feature maps. Images of the first column are from COCO val2017. The second and the third column depict the output feature maps of 'layer4' of ResNet-18 pre-trained by PCD and vectorized variant of PCD, respectively.