Table of Contents
Fetching ...

Is Pre-training Applicable to the Decoder for Dense Prediction?

Chao Ning, Wanshui Gan, Weihao Xuan, Naoto Yokoya

TL;DR

Is Pre-training Applicable to the Decoder for Dense Prediction? investigates whether decoders in encoder–decoder dense prediction networks can benefit from pre-training. The authors propose ×Net, a framework that revamps the decoder to be compatible with pre-trained representations via a reversed decoder, Re-Shape module, and Post Feature Pyramid, plus feature mixing strategies, enabling a pre-trained encoder × pre-trained decoder collaboration without task-specific decoding tricks. Extensive experiments on monocular depth estimation and semantic segmentation show that a pre-trained decoder enriches intermediate features with semantic information, yielding improved accuracy and sharper predictions while maintaining efficiency, achieving state-of-the-art results in several benchmarks. The work highlights a practical route to leverage large-scale pre-trained models for decoding, potentially broadening the impact of pre-training on dense prediction tasks.

Abstract

Pre-trained encoders are widely employed in dense prediction tasks for their capability to effectively extract visual features from images. The decoder subsequently processes these features to generate pixel-level predictions. However, due to structural differences and variations in input data, only encoders benefit from pre-learned representations from vision benchmarks such as image classification and self-supervised learning, while decoders are typically trained from scratch. In this paper, we introduce $\times$Net, which facilitates a "pre-trained encoder $\times$ pre-trained decoder" collaboration through three innovative designs. $\times$Net enables the direct utilization of pre-trained models within the decoder, integrating pre-learned representations into the decoding process to enhance performance in dense prediction tasks. By simply coupling the pre-trained encoder and pre-trained decoder, $\times$Net distinguishes itself as a highly promising approach. Remarkably, it achieves this without relying on decoding-specific structures or task-specific algorithms. Despite its streamlined design, $\times$Net outperforms advanced methods in tasks such as monocular depth estimation and semantic segmentation, achieving state-of-the-art performance particularly in monocular depth estimation. and semantic segmentation, achieving state-of-the-art results, especially in monocular depth estimation. embedding algorithms. Despite its streamlined design, $\times$Net outperforms advanced methods in tasks such as monocular depth estimation and semantic segmentation, achieving state-of-the-art performance particularly in monocular depth estimation.

Is Pre-training Applicable to the Decoder for Dense Prediction?

TL;DR

Is Pre-training Applicable to the Decoder for Dense Prediction? investigates whether decoders in encoder–decoder dense prediction networks can benefit from pre-training. The authors propose ×Net, a framework that revamps the decoder to be compatible with pre-trained representations via a reversed decoder, Re-Shape module, and Post Feature Pyramid, plus feature mixing strategies, enabling a pre-trained encoder × pre-trained decoder collaboration without task-specific decoding tricks. Extensive experiments on monocular depth estimation and semantic segmentation show that a pre-trained decoder enriches intermediate features with semantic information, yielding improved accuracy and sharper predictions while maintaining efficiency, achieving state-of-the-art results in several benchmarks. The work highlights a practical route to leverage large-scale pre-trained models for decoding, potentially broadening the impact of pre-training on dense prediction tasks.

Abstract

Pre-trained encoders are widely employed in dense prediction tasks for their capability to effectively extract visual features from images. The decoder subsequently processes these features to generate pixel-level predictions. However, due to structural differences and variations in input data, only encoders benefit from pre-learned representations from vision benchmarks such as image classification and self-supervised learning, while decoders are typically trained from scratch. In this paper, we introduce Net, which facilitates a "pre-trained encoder pre-trained decoder" collaboration through three innovative designs. Net enables the direct utilization of pre-trained models within the decoder, integrating pre-learned representations into the decoding process to enhance performance in dense prediction tasks. By simply coupling the pre-trained encoder and pre-trained decoder, Net distinguishes itself as a highly promising approach. Remarkably, it achieves this without relying on decoding-specific structures or task-specific algorithms. Despite its streamlined design, Net outperforms advanced methods in tasks such as monocular depth estimation and semantic segmentation, achieving state-of-the-art performance particularly in monocular depth estimation. and semantic segmentation, achieving state-of-the-art results, especially in monocular depth estimation. embedding algorithms. Despite its streamlined design, Net outperforms advanced methods in tasks such as monocular depth estimation and semantic segmentation, achieving state-of-the-art performance particularly in monocular depth estimation.

Paper Structure

This paper contains 18 sections, 7 equations, 5 figures, 9 tables.

Figures (5)

  • Figure 1: Comparison of pre-trained and non-pre-trained decoders. When paired with a consistent encoder and architecture, a pre-trained decoder enhances the semantic richness of encoded feature maps and refines the final predictions.
  • Figure 2: Overview of the image classification model (left), encoder-decoder network (middle), and our $\times$Net (right). The pre-trained model represents the commonly used hierarchical structure.
  • Figure 3: The difference between $\times$Net and $\times$Net-I.
  • Figure 4: Qualitative comparisons of decoder pre-training on monocular depth estimation (left) and semantic segmentation (right). For mixed feature visualization, extract the maximum value from the RGB components, setting the remaining values to zero.
  • Figure 5: Qualitative comparisons on DDAD MDE (left) and NYU-Depth-V2 MDE (middle) and ADE20k semantic segmentation (right).