Table of Contents
Fetching ...

Expectation-Maximization Attention Networks for Semantic Segmentation

Xia Li, Zhisheng Zhong, Jianlong Wu, Yibo Yang, Zhouchen Lin, Hong Liu

TL;DR

The paper tackles the high computational cost of self-attention for semantic segmentation by reframing attention as an Expectation-Maximization (EM) process. It introduces EMA to iteratively learn a compact basis set and reconstruct features via a low-rank representation, leading to reduced complexity from $O(N^2)$ to $O(NKT)$ and improved robustness. The EMA Unit (EMAU) integrates this mechanism into CNNs with stability-enhancing bases maintenance and normalization, achieving state-of-the-art results on PASCAL VOC, PASCAL Context, and COCO Stuff with favorable efficiency. The work demonstrates that EM-driven attention can capture salient semantics with interpretable, semantically meaningful bases, offering practical gains for large-scale semantic segmentation.

Abstract

Self-attention mechanism has been widely used for various tasks. It is designed to compute the representation of each position by a weighted sum of the features at all positions. Thus, it can capture long-range relations for computer vision tasks. However, it is computationally consuming. Since the attention maps are computed w.r.t all other positions. In this paper, we formulate the attention mechanism into an expectation-maximization manner and iteratively estimate a much more compact set of bases upon which the attention maps are computed. By a weighted summation upon these bases, the resulting representation is low-rank and deprecates noisy information from the input. The proposed Expectation-Maximization Attention (EMA) module is robust to the variance of input and is also friendly in memory and computation. Moreover, we set up the bases maintenance and normalization methods to stabilize its training procedure. We conduct extensive experiments on popular semantic segmentation benchmarks including PASCAL VOC, PASCAL Context and COCO Stuff, on which we set new records.

Expectation-Maximization Attention Networks for Semantic Segmentation

TL;DR

The paper tackles the high computational cost of self-attention for semantic segmentation by reframing attention as an Expectation-Maximization (EM) process. It introduces EMA to iteratively learn a compact basis set and reconstruct features via a low-rank representation, leading to reduced complexity from to and improved robustness. The EMA Unit (EMAU) integrates this mechanism into CNNs with stability-enhancing bases maintenance and normalization, achieving state-of-the-art results on PASCAL VOC, PASCAL Context, and COCO Stuff with favorable efficiency. The work demonstrates that EM-driven attention can capture salient semantics with interpretable, semantically meaningful bases, offering practical gains for large-scale semantic segmentation.

Abstract

Self-attention mechanism has been widely used for various tasks. It is designed to compute the representation of each position by a weighted sum of the features at all positions. Thus, it can capture long-range relations for computer vision tasks. However, it is computationally consuming. Since the attention maps are computed w.r.t all other positions. In this paper, we formulate the attention mechanism into an expectation-maximization manner and iteratively estimate a much more compact set of bases upon which the attention maps are computed. By a weighted summation upon these bases, the resulting representation is low-rank and deprecates noisy information from the input. The proposed Expectation-Maximization Attention (EMA) module is robust to the variance of input and is also friendly in memory and computation. Moreover, we set up the bases maintenance and normalization methods to stabilize its training procedure. We conduct extensive experiments on popular semantic segmentation benchmarks including PASCAL VOC, PASCAL Context and COCO Stuff, on which we set new records.

Paper Structure

This paper contains 25 sections, 16 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Pipeline of the proposed expectation-maximization attention method.
  • Figure 2: Overall structure of the proposed EMAU. The key component is the EMA operator, in which $A_{\mathrm{E}}$ and $A_{\mathrm{M}}$ execute alternately. In addition to the EMA operator, we add two $1 \times 1$ convolutions at the beginning and the end of EMA and sum the output with original input, to form a residual-like block. Best viewed on screen.
  • Figure 3: Ablation study on strategy of bases maintenance (left) and normalization (right) of EMAU. Experiments are carried out upon ResNet-50 with batch size $12$ and training output stride $16$ on the PASCAL VOC dataset. The iteration number $T$ for training is set as $3$. Best viewed on screen.
  • Figure 4: Ablation study on the iteration number $T$. Experiments are conducted upon ResNet-50 with training output stride $16$ and batch size $12$ on the PASCAL VOC dataset.
  • Figure 5: Visualization of responsibilities $\mathbf{Z}$ at the last iteration. The first two rows illustrate two examples from the PASCAL VOC validation set. The last two rows illustrate two examples from the PASCAL Context validation set. $z_{\cdot i}$ represents the responsibilities of the $i$-th basis to all pixels in the last iteration, $i, j, k$ and $l$ are four randomly selected indexes, where $1 \leq i, j, k, l \leq K$. Best viewed on screen.