Table of Contents
Fetching ...

ViT-CoMer: Vision Transformer with Convolutional Multi-scale Feature Interaction for Dense Predictions

Chunlong Xia, Xinliang Wang, Feng Lv, Xin Hao, Yifeng Shi

TL;DR

ViT-CoMer tackles the dense prediction gap of Vision Transformers by introducing a plain ViT backbone augmented with CNN-derived multi-scale features. The MRFP module injects diverse receptive fields, while the CTI module enables bidirectional fusion between CNN and ViT representations, all without altering the ViT core architecture. Across object detection, instance segmentation, and semantic segmentation, ViT-CoMer-L achieves competitive or superior results to state-of-the-art backbones and can leverage a wide range of open-source pre-training, including multi-modal weights. This approach offers a practical, pre-training-friendly path to strong dense-prediction performance and demonstrates robust scalability to hierarchical transformers.

Abstract

Although Vision Transformer (ViT) has achieved significant success in computer vision, it does not perform well in dense prediction tasks due to the lack of inner-patch information interaction and the limited diversity of feature scale. Most existing studies are devoted to designing vision-specific transformers to solve the above problems, which introduce additional pre-training costs. Therefore, we present a plain, pre-training-free, and feature-enhanced ViT backbone with Convolutional Multi-scale feature interaction, named ViT-CoMer, which facilitates bidirectional interaction between CNN and transformer. Compared to the state-of-the-art, ViT-CoMer has the following advantages: (1) We inject spatial pyramid multi-receptive field convolutional features into the ViT architecture, which effectively alleviates the problems of limited local information interaction and single-feature representation in ViT. (2) We propose a simple and efficient CNN-Transformer bidirectional fusion interaction module that performs multi-scale fusion across hierarchical features, which is beneficial for handling dense prediction tasks. (3) We evaluate the performance of ViT-CoMer across various dense prediction tasks, different frameworks, and multiple advanced pre-training. Notably, our ViT-CoMer-L achieves 64.3% AP on COCO val2017 without extra training data, and 62.1% mIoU on ADE20K val, both of which are comparable to state-of-the-art methods. We hope ViT-CoMer can serve as a new backbone for dense prediction tasks to facilitate future research. The code will be released at https://github.com/Traffic-X/ViT-CoMer.

ViT-CoMer: Vision Transformer with Convolutional Multi-scale Feature Interaction for Dense Predictions

TL;DR

ViT-CoMer tackles the dense prediction gap of Vision Transformers by introducing a plain ViT backbone augmented with CNN-derived multi-scale features. The MRFP module injects diverse receptive fields, while the CTI module enables bidirectional fusion between CNN and ViT representations, all without altering the ViT core architecture. Across object detection, instance segmentation, and semantic segmentation, ViT-CoMer-L achieves competitive or superior results to state-of-the-art backbones and can leverage a wide range of open-source pre-training, including multi-modal weights. This approach offers a practical, pre-training-friendly path to strong dense-prediction performance and demonstrates robust scalability to hierarchical transformers.

Abstract

Although Vision Transformer (ViT) has achieved significant success in computer vision, it does not perform well in dense prediction tasks due to the lack of inner-patch information interaction and the limited diversity of feature scale. Most existing studies are devoted to designing vision-specific transformers to solve the above problems, which introduce additional pre-training costs. Therefore, we present a plain, pre-training-free, and feature-enhanced ViT backbone with Convolutional Multi-scale feature interaction, named ViT-CoMer, which facilitates bidirectional interaction between CNN and transformer. Compared to the state-of-the-art, ViT-CoMer has the following advantages: (1) We inject spatial pyramid multi-receptive field convolutional features into the ViT architecture, which effectively alleviates the problems of limited local information interaction and single-feature representation in ViT. (2) We propose a simple and efficient CNN-Transformer bidirectional fusion interaction module that performs multi-scale fusion across hierarchical features, which is beneficial for handling dense prediction tasks. (3) We evaluate the performance of ViT-CoMer across various dense prediction tasks, different frameworks, and multiple advanced pre-training. Notably, our ViT-CoMer-L achieves 64.3% AP on COCO val2017 without extra training data, and 62.1% mIoU on ADE20K val, both of which are comparable to state-of-the-art methods. We hope ViT-CoMer can serve as a new backbone for dense prediction tasks to facilitate future research. The code will be released at https://github.com/Traffic-X/ViT-CoMer.
Paper Structure (16 sections, 4 equations, 6 figures, 11 tables)

This paper contains 16 sections, 4 equations, 6 figures, 11 tables.

Figures (6)

  • Figure 1: Object detection performance on COCO val2017 using Mask R-CNN. Our ViT-CoMer, with advanced pre-trained weights of ViT, outperforms other methods. "$\dagger$" denotes the utilization of advanced pre-trained weights, otherwise ImageNet-1K.
  • Figure 2: Different backbone paradigms for dense predictions. (a) Plain backbone paradigm can leverage open-source advanced pre-trained weights (e.g., BEiT series bao2021beitpeng2022beitbeit3, DINOv2 oquab2023dinov2). However, its drawback lies in the limited scale diversity of feature representation, which is insufficient to meet the requirements of dense predictions. (b) Vision-specific backbone paradigm designs a multi-scale feature framework that effectively addresses dense predictions. However, each structural modification requires retraining the pre-trained weights from scratch on large-scale image datasets. (c) Adapted backbone paradigm integrates the advantages of both CNN and transformer. It can directly load advanced pre-training and achieve fusion interaction between multi-scale convolutional features and transformer features, which is beneficial for dense predictions.
  • Figure 3: The overall architecture of ViT-CoMer. ViT-CoMer is a two-branch architecture consisting of three components: (a) a plain ViT with L layers, which is evenly divided into N stages for feature interaction. (b) a CNN branch that employs the proposed Multi-Receptive Field Feature Pyramid (MRFP) module to provide multi-scale spatial features, and (c) a simple and efficient CNN-Transformer Bidirectional Fusion Interaction (CTI) module to integrate the features of the two branches at different stages, enhancing semantic information.
  • Figure 4: Multi-Receptive Field Feature Pyramid module. The $C_{3}$, $C_{4}$, and $C_{5}$ features are first dimensionally reduced through a linear projection layer. Subsequently, these features are divided into multiple groups along the channel dimension. Different groups employ varied kernel sizes of DWConv to enrich receptive field representation, MRC represents a multi-receptive field convolution operation. Finally, the features are restored to their original dimensions through dimensional expansion.
  • Figure 5: CNN-Transformer Bidirectional Fusion Interaction module. {$F_{3}$, $F_{4}$, $F_{5}$} are multi-scale CNN features obtained through the MRFP module. We add $F_{4}$ and $X$ from the ViT branch and use a multi-scale self-attention module to unify the two modal features, ultimately achieving information interaction and obtaining updated features.
  • ...and 1 more figures