Table of Contents
Fetching ...

Beyond the Encoder: Joint Encoder-Decoder Contrastive Pre-Training Improves Dense Prediction

Sébastien Quetin, Tapotosh Ghosh, Farhad Maleki

TL;DR

DeCon, an efficient encoder-decoder self-supervised learning (SSL) framework that supports joint contrastive pre-training of encoder-decoder architectures, and introduces a weighted encoder-decoder contrastive loss with non-competing objectives to enable the joint pre-training of encoder-decoder architectures.

Abstract

Contrastive learning methods in self-supervised settings have primarily focused on pre-training encoders, while decoders are typically introduced and trained separately for downstream dense prediction tasks. However, this conventional approach overlooks the potential benefits of jointly pre-training both encoder and decoder. In this paper, we propose DeCon, an efficient encoder-decoder self-supervised learning (SSL) framework that supports joint contrastive pre-training. We first extend existing SSL architectures to accommodate diverse decoders and their corresponding contrastive losses. Then, we introduce a weighted encoder-decoder contrastive loss with non-competing objectives to enable the joint pre-training of encoder-decoder architectures. By adapting a contrastive SSL framework for dense prediction, DeCon establishes consistent state-of-the-art performance on most of the evaluated tasks when pre-trained on Imagenet-1K, COCO and COCO+. Notably, when pre-training a ResNet-50 encoder on COCO dataset, DeCon improves COCO object detection and instance segmentation compared to the baseline framework by +0.37 AP and +0.32 AP, respectively, and boosts semantic segmentation by +1.42 mIoU on Pascal VOC and by +0.50 mIoU on Cityscapes. These improvements generalize across recent backbones, decoders, datasets, and dense tasks beyond segmentation and object detection, and persist in out-of-domain scenarios, including limited-data settings, demonstrating that joint pre-training significantly enhances representation quality for dense prediction. Code is available at https://github.com/sebquetin/DeCon.git.

Beyond the Encoder: Joint Encoder-Decoder Contrastive Pre-Training Improves Dense Prediction

TL;DR

DeCon, an efficient encoder-decoder self-supervised learning (SSL) framework that supports joint contrastive pre-training of encoder-decoder architectures, and introduces a weighted encoder-decoder contrastive loss with non-competing objectives to enable the joint pre-training of encoder-decoder architectures.

Abstract

Contrastive learning methods in self-supervised settings have primarily focused on pre-training encoders, while decoders are typically introduced and trained separately for downstream dense prediction tasks. However, this conventional approach overlooks the potential benefits of jointly pre-training both encoder and decoder. In this paper, we propose DeCon, an efficient encoder-decoder self-supervised learning (SSL) framework that supports joint contrastive pre-training. We first extend existing SSL architectures to accommodate diverse decoders and their corresponding contrastive losses. Then, we introduce a weighted encoder-decoder contrastive loss with non-competing objectives to enable the joint pre-training of encoder-decoder architectures. By adapting a contrastive SSL framework for dense prediction, DeCon establishes consistent state-of-the-art performance on most of the evaluated tasks when pre-trained on Imagenet-1K, COCO and COCO+. Notably, when pre-training a ResNet-50 encoder on COCO dataset, DeCon improves COCO object detection and instance segmentation compared to the baseline framework by +0.37 AP and +0.32 AP, respectively, and boosts semantic segmentation by +1.42 mIoU on Pascal VOC and by +0.50 mIoU on Cityscapes. These improvements generalize across recent backbones, decoders, datasets, and dense tasks beyond segmentation and object detection, and persist in out-of-domain scenarios, including limited-data settings, demonstrating that joint pre-training significantly enhances representation quality for dense prediction. Code is available at https://github.com/sebquetin/DeCon.git.

Paper Structure

This paper contains 12 sections, 2 equations, 3 figures, 9 tables.

Figures (3)

  • Figure 1: A: DeCon-SL. Instead of a classical encoder-only pre-training, a decoder is pre-trained alongside the encoder. We mirror the encoder loss at the decoder level and optimize the architecture using a weighted sum of encoder and decoder losses. B: DeCon-ML. Instead of computing the decoder loss at a single level, it is calculated across multiple levels (four in this figure). Additionally, a channel-wise dropout is applied at the output of each encoder level before it is passed through the skip connection to the decoder.
  • Figure 2: DeCon-ML-L and SlotCon pre-training loss dynamics.
  • Figure 3: Slots, as defined in SlotCon slotcon_paper_ssl, learned at different outputs of our architecture. The top left image from COCO validation dataset is overlayed with the slot that was the most represented in original SlotCon's encoder output feature map, resized to the image's shape. Other slots displayed are the slots that had the biggest overlap with SlotCon's main encoder output slot. More details on the slots creation are available in Supplementary Material 6.