Table of Contents
Fetching ...

Representation Separation for Semantic Segmentation with Vision Transformers

Yuanduo Hong, Huihui Pan, Weichao Sun, Xinghu Yu, Huijun Gao

TL;DR

An efficient framework of representation separation in local-patch level and global-region level for semantic segmentation with ViTs is presented and the improved representations have favorable transferability in images with natural corruptions.

Abstract

Vision transformers (ViTs) encoding an image as a sequence of patches bring new paradigms for semantic segmentation.We present an efficient framework of representation separation in local-patch level and global-region level for semantic segmentation with ViTs. It is targeted for the peculiar over-smoothness of ViTs in semantic segmentation, and therefore differs from current popular paradigms of context modeling and most existing related methods reinforcing the advantage of attention. We first deliver the decoupled two-pathway network in which another pathway enhances and passes down local-patch discrepancy complementary to global representations of transformers. We then propose the spatially adaptive separation module to obtain more separate deep representations and the discriminative cross-attention which yields more discriminative region representations through novel auxiliary supervisions. The proposed methods achieve some impressive results: 1) incorporated with large-scale plain ViTs, our methods achieve new state-of-the-art performances on five widely used benchmarks; 2) using masked pre-trained plain ViTs, we achieve 68.9% mIoU on Pascal Context, setting a new record; 3) pyramid ViTs integrated with the decoupled two-pathway network even surpass the well-designed high-resolution ViTs on Cityscapes; 4) the improved representations by our framework have favorable transferability in images with natural corruptions. The codes will be released publicly.

Representation Separation for Semantic Segmentation with Vision Transformers

TL;DR

An efficient framework of representation separation in local-patch level and global-region level for semantic segmentation with ViTs is presented and the improved representations have favorable transferability in images with natural corruptions.

Abstract

Vision transformers (ViTs) encoding an image as a sequence of patches bring new paradigms for semantic segmentation.We present an efficient framework of representation separation in local-patch level and global-region level for semantic segmentation with ViTs. It is targeted for the peculiar over-smoothness of ViTs in semantic segmentation, and therefore differs from current popular paradigms of context modeling and most existing related methods reinforcing the advantage of attention. We first deliver the decoupled two-pathway network in which another pathway enhances and passes down local-patch discrepancy complementary to global representations of transformers. We then propose the spatially adaptive separation module to obtain more separate deep representations and the discriminative cross-attention which yields more discriminative region representations through novel auxiliary supervisions. The proposed methods achieve some impressive results: 1) incorporated with large-scale plain ViTs, our methods achieve new state-of-the-art performances on five widely used benchmarks; 2) using masked pre-trained plain ViTs, we achieve 68.9% mIoU on Pascal Context, setting a new record; 3) pyramid ViTs integrated with the decoupled two-pathway network even surpass the well-designed high-resolution ViTs on Cityscapes; 4) the improved representations by our framework have favorable transferability in images with natural corruptions. The codes will be released publicly.
Paper Structure (36 sections, 17 equations, 37 figures, 15 tables)

This paper contains 36 sections, 17 equations, 37 figures, 15 tables.

Figures (37)

  • Figure 1: The comparison between our representation separation for ViTs and the context modeling for CNNs. The representations learned by CNNs are usually dispersive due to the limited receptive fields. The context modeling modules learn more discriminative representations by reducing the intra-class variances. Compared to CNNs, the learned representations of ViTs are more coherent and even too close to distinguish different categories. The proposed methods separate the representations of different categories while maintaining the small intra-class variances.
  • Figure 2: Comparisons with previous state-of-the-art methods on five benchmarks. All the results are obtained with ViT-Lsteiner2021train and multi-scale test.
  • Figure 4: The overview of our methods with plain ViTs and pyramid ViTs. The blue blocks denote LSBs and the yellow blocks denote pre-trained transformer layers. The dotted box denotes the module is dropped during inference.
  • Figure 5: The details of spatially adaptive separation module.
  • Figure 6: The details of discriminative cross-attention.
  • ...and 32 more figures