Table of Contents
Fetching ...

Rethinking Decoders for Transformer-based Semantic Segmentation: A Compression Perspective

Qishuai Wen, Chun-Guang Li

TL;DR

The paper addresses the lack of theoretical grounding for Transformer decoders in semantic segmentation by framing decoding as a PCA-based compression problem. It derives DEPICT, a white-box, fully attentional decoder that uses self-attention to refine image embeddings into an ideal principal subspace and cross-attention to obtain a low-rank, class-aligned representation, with dot-product decoding producing masks. Through unrolled gradient steps and a grouped MSSA/MSCA formulation, DEPICT provides a principled alternative to black-box decoders and demonstrates superior or competitive performance with far fewer parameters on ADE20K, Cityscapes, and Pascal Context. The work contributes a concrete interpretability framework linking compression theory to vision transformers and highlights orthogonality and robustness properties as key advantages for principled semantic segmentation.

Abstract

State-of-the-art methods for Transformer-based semantic segmentation typically adopt Transformer decoders that are used to extract additional embeddings from image embeddings via cross-attention, refine either or both types of embeddings via self-attention, and project image embeddings onto the additional embeddings via dot-product. Despite their remarkable success, these empirical designs still lack theoretical justifications or interpretations, thus hindering potentially principled improvements. In this paper, we argue that there are fundamental connections between semantic segmentation and compression, especially between the Transformer decoders and Principal Component Analysis (PCA). From such a perspective, we derive a white-box, fully attentional DEcoder for PrIncipled semantiC segemenTation (DEPICT), with the interpretations as follows: 1) the self-attention operator refines image embeddings to construct an ideal principal subspace that aligns with the supervision and retains most information; 2) the cross-attention operator seeks to find a low-rank approximation of the refined image embeddings, which is expected to be a set of orthonormal bases of the principal subspace and corresponds to the predefined classes; 3) the dot-product operation yields compact representation for image embeddings as segmentation masks. Experiments conducted on dataset ADE20K find that DEPICT consistently outperforms its black-box counterpart, Segmenter, and it is light weight and more robust.

Rethinking Decoders for Transformer-based Semantic Segmentation: A Compression Perspective

TL;DR

The paper addresses the lack of theoretical grounding for Transformer decoders in semantic segmentation by framing decoding as a PCA-based compression problem. It derives DEPICT, a white-box, fully attentional decoder that uses self-attention to refine image embeddings into an ideal principal subspace and cross-attention to obtain a low-rank, class-aligned representation, with dot-product decoding producing masks. Through unrolled gradient steps and a grouped MSSA/MSCA formulation, DEPICT provides a principled alternative to black-box decoders and demonstrates superior or competitive performance with far fewer parameters on ADE20K, Cityscapes, and Pascal Context. The work contributes a concrete interpretability framework linking compression theory to vision transformers and highlights orthogonality and robustness properties as key advantages for principled semantic segmentation.

Abstract

State-of-the-art methods for Transformer-based semantic segmentation typically adopt Transformer decoders that are used to extract additional embeddings from image embeddings via cross-attention, refine either or both types of embeddings via self-attention, and project image embeddings onto the additional embeddings via dot-product. Despite their remarkable success, these empirical designs still lack theoretical justifications or interpretations, thus hindering potentially principled improvements. In this paper, we argue that there are fundamental connections between semantic segmentation and compression, especially between the Transformer decoders and Principal Component Analysis (PCA). From such a perspective, we derive a white-box, fully attentional DEcoder for PrIncipled semantiC segemenTation (DEPICT), with the interpretations as follows: 1) the self-attention operator refines image embeddings to construct an ideal principal subspace that aligns with the supervision and retains most information; 2) the cross-attention operator seeks to find a low-rank approximation of the refined image embeddings, which is expected to be a set of orthonormal bases of the principal subspace and corresponds to the predefined classes; 3) the dot-product operation yields compact representation for image embeddings as segmentation masks. Experiments conducted on dataset ADE20K find that DEPICT consistently outperforms its black-box counterpart, Segmenter, and it is light weight and more robust.

Paper Structure

This paper contains 24 sections, 46 equations, 10 figures, 5 tables.

Figures (10)

  • Figure 1: Illustration for Segmenter and MaskFormer. a) Segmenter. b) MaskFormer. c) Transformer block adopted by Segmenter. We omit the details of the Transformer decoder adopted by MaskFormer, which refines image embeddings and mask embeddings via self-attention respectively before the cross-attention operations.
  • Figure 2: Image Segmentation via PCA and DEPICT. Given an image, we segment it via PCA and our DEPICT. We perform PCA on its representations $\boldsymbol{Z}_0$ and $\boldsymbol{Z}_{L_1}$, respectively, where the first 10 principal directions are used as cluster centroids. We find that PCA can serve as an effective method for image segmentation especially on the refined features, like $\boldsymbol{Z}_{L_1}$. We also observe that performing PCA on $\boldsymbol{Z}_0$ is more likely to lead to an over-segmentation, which indicates that its principal subspace is not ideal.
  • Figure 3: Illustration for DEPICT. Given an image for semantic segmentation, we represent it as $\boldsymbol{Z}_0$ by the ViT backbone. Segmenting it by performing PCA on $\boldsymbol{Z}_0$, we find that $\mathcal{S}$ of $\boldsymbol{Z}_0$ is not ideal. We thus adopt the MSSA operator to refine the image embeddings, iteratively constructing an ideal $\mathcal{S}$. Performing PCA again on $\boldsymbol{Z}_{L_1}$, we find that the segmentation results are improved. Then, we adopt the MSCA operator to find a low-rank approximation of $\boldsymbol{Z}_{L_1}$ that lies in $\mathcal{S}$ as classifiers. For example, we use the dogs and cats on the right to represent image embeddings of two different classes in the feature space. Initially, the projections of dogs and cats onto $\mathcal{S}$ are not well linearly separable. DEPICT, however, constructs an ideal $\mathcal{S}$ and effectively classify them.
  • Figure 4: Investigating orthogonality in DEPICT.Left: $\boldsymbol{P}^\top \boldsymbol{P}$; Right: $\boldsymbol{Q}^\top \boldsymbol{Q}$. All variants are based on ViT-L. Since that the MHSA operator contains three parameter matrices, unlike MSSA which has only one, we visualize the matrix responsible for transforming queries. Notably, all the $\boldsymbol{Q}$'s are normalized, whereas $\boldsymbol{P}$ is not.
  • Figure 5: Inner product of class embeddings across images. We group the class embeddings by their classes and visualize the inner-product among them. We exemplify 30 classes across 100 images. All variants are based on ViT-L.
  • ...and 5 more figures