Table of Contents
Fetching ...

Scale Equalization for Multi-Level Feature Fusion

Bum Jun Kim, Sang Woo Kim

TL;DR

This work identifies scale disequilibrium as a key issue in multi-level feature fusion for semantic segmentation, arising from bilinear upsampling which reduces feature variance and biases gradient scales at initialization. It proposes Scale Equalizers, simple post-upsampling normalizers using dataset-wide mean and standard deviation $\mu$ and $\sigma$ to achieve zero-mean, unit-variance inputs before fusion, effectively functioning as a cost-free initialization for fusion weights. The approach is theoretically motivated by gradient-scale considerations and BN properties, and empirically validated across multiple backbones (e.g., Swin, Twins, ConvNeXt) and datasets (ADE20K, PASCAL VOC 2012, Cityscapes), improving $mIoU$ by about $+0.1$ to $+0.4$ on average. The method is lightweight, hyperparameter-free, and easily integrated into existing decoders such as UPerHead, PSPHead, ASPPHead, and SepASPPHead, offering practical impact for robust multi-level feature fusion in diverse segmentation tasks.

Abstract

Deep neural networks have exhibited remarkable performance in a variety of computer vision fields, especially in semantic segmentation tasks. Their success is often attributed to multi-level feature fusion, which enables them to understand both global and local information from an image. However, we found that multi-level features from parallel branches are on different scales. The scale disequilibrium is a universal and unwanted flaw that leads to detrimental gradient descent, thereby degrading performance in semantic segmentation. We discover that scale disequilibrium is caused by bilinear upsampling, which is supported by both theoretical and empirical evidence. Based on this observation, we propose injecting scale equalizers to achieve scale equilibrium across multi-level features after bilinear upsampling. Our proposed scale equalizers are easy to implement, applicable to any architecture, hyperparameter-free, implementable without requiring extra computational cost, and guarantee scale equilibrium for any dataset. Experiments showed that adopting scale equalizers consistently improved the mIoU index across various target datasets, including ADE20K, PASCAL VOC 2012, and Cityscapes, as well as various decoder choices, including UPerHead, PSPHead, ASPPHead, SepASPPHead, and FCNHead.

Scale Equalization for Multi-Level Feature Fusion

TL;DR

This work identifies scale disequilibrium as a key issue in multi-level feature fusion for semantic segmentation, arising from bilinear upsampling which reduces feature variance and biases gradient scales at initialization. It proposes Scale Equalizers, simple post-upsampling normalizers using dataset-wide mean and standard deviation and to achieve zero-mean, unit-variance inputs before fusion, effectively functioning as a cost-free initialization for fusion weights. The approach is theoretically motivated by gradient-scale considerations and BN properties, and empirically validated across multiple backbones (e.g., Swin, Twins, ConvNeXt) and datasets (ADE20K, PASCAL VOC 2012, Cityscapes), improving by about to on average. The method is lightweight, hyperparameter-free, and easily integrated into existing decoders such as UPerHead, PSPHead, ASPPHead, and SepASPPHead, offering practical impact for robust multi-level feature fusion in diverse segmentation tasks.

Abstract

Deep neural networks have exhibited remarkable performance in a variety of computer vision fields, especially in semantic segmentation tasks. Their success is often attributed to multi-level feature fusion, which enables them to understand both global and local information from an image. However, we found that multi-level features from parallel branches are on different scales. The scale disequilibrium is a universal and unwanted flaw that leads to detrimental gradient descent, thereby degrading performance in semantic segmentation. We discover that scale disequilibrium is caused by bilinear upsampling, which is supported by both theoretical and empirical evidence. Based on this observation, we propose injecting scale equalizers to achieve scale equilibrium across multi-level features after bilinear upsampling. Our proposed scale equalizers are easy to implement, applicable to any architecture, hyperparameter-free, implementable without requiring extra computational cost, and guarantee scale equilibrium for any dataset. Experiments showed that adopting scale equalizers consistently improved the mIoU index across various target datasets, including ADE20K, PASCAL VOC 2012, and Cityscapes, as well as various decoder choices, including UPerHead, PSPHead, ASPPHead, SepASPPHead, and FCNHead.
Paper Structure (29 sections, 2 theorems, 7 equations, 3 figures, 2 tables, 1 algorithm)

This paper contains 29 sections, 2 theorems, 7 equations, 3 figures, 2 tables, 1 algorithm.

Key Result

Proposition 3.1

Consider a multi-level feature fusion, where a concatenated feature $[x_1; x_2]$ is subjected to a linear layer with weight $[w_1, w_2]$ and bias $b$ to yield the fused feature $y=w_1 x_1 + w_2 x_2 + b$. When the two features $x_1$ and $x_2$ are on different scales, i.e., $\mathop{\mathrm{Var}}\noli

Figures (3)

  • Figure 1: Visualization of the architecture of modern decoders: (a) UPerHead, (b) PSPHead, (c) ASPPHead and SepASPPHead, and (d) their general form.
  • Figure 2: Overview of the problem statement and the proposed solution. This illustration depicts a fusion by UPerHead for two features for simplicity, but nonetheless, the common fusion scheme uses four features. (Top) Existing multi-level feature fusion concatenates features after bilinear upsampling. The variances of concatenation subjects, represented as chroma in this figure, exhibit disequilibrium because bilinear upsampling decreases variance. In this fusion, $\mathbf{P}_1$ dominates in the fused feature as a red color, which diminishes the contribution of $\mathbf{P}_2$ and causes slower training on $w_2$. (Middle) Our proposed multi-level feature fusion with scale equalizers guarantees consistent variance across subjects of concatenation. In this scheme, a suitably fused feature as a purple color is produced with alive gradients with respect to both $w_1$ and $w_2$. (Bottom) Efficient implementation of our proposed method, where scale equalizers are replaced by applying auxiliary initialization for $w_1$ and $w_2$.
  • Figure 3: Empirical observation on decreased variance after bilinear upsampling. The black dotted line $(\pi-1)/2\pi$ corresponds to the case when the output of a convolutional unit block is subjected to bilinear upsampling.

Theorems & Definitions (2)

  • Proposition 3.1
  • Theorem 3.2