Table of Contents
Fetching ...

Hierarchical Multi-Scale Attention for Semantic Segmentation

Andrew Tao, Karan Sapra, Bryan Catanzaro

TL;DR

This work tackles semantic segmentation by addressing the trade-off between inference scales through a hierarchical multi-scale attention mechanism. The method learns relative attention between adjacent scales, enabling a memory-efficient, chainable fusion of predictions across multiple resolutions and allowing flexible inference with unseen scales. An auto-labelling strategy with hard labels boosts Cityscapes generalization, contributing to state-of-the-art results on Cityscapes (85.1 IOU) and Mapillary Vistas (61.1 IOU) without prohibitive training cost. Overall, the approach improves fine-detail and global-context predictions while reducing training memory, making multi-scale inference more practical for large-scale datasets.

Abstract

Multi-scale inference is commonly used to improve the results of semantic segmentation. Multiple images scales are passed through a network and then the results are combined with averaging or max pooling. In this work, we present an attention-based approach to combining multi-scale predictions. We show that predictions at certain scales are better at resolving particular failures modes, and that the network learns to favor those scales for such cases in order to generate better predictions. Our attention mechanism is hierarchical, which enables it to be roughly 4x more memory efficient to train than other recent approaches. In addition to enabling faster training, this allows us to train with larger crop sizes which leads to greater model accuracy. We demonstrate the result of our method on two datasets: Cityscapes and Mapillary Vistas. For Cityscapes, which has a large number of weakly labelled images, we also leverage auto-labelling to improve generalization. Using our approach we achieve a new state-of-the-art results in both Mapillary (61.1 IOU val) and Cityscapes (85.1 IOU test).

Hierarchical Multi-Scale Attention for Semantic Segmentation

TL;DR

This work tackles semantic segmentation by addressing the trade-off between inference scales through a hierarchical multi-scale attention mechanism. The method learns relative attention between adjacent scales, enabling a memory-efficient, chainable fusion of predictions across multiple resolutions and allowing flexible inference with unseen scales. An auto-labelling strategy with hard labels boosts Cityscapes generalization, contributing to state-of-the-art results on Cityscapes (85.1 IOU) and Mapillary Vistas (61.1 IOU) without prohibitive training cost. Overall, the approach improves fine-detail and global-context predictions while reducing training memory, making multi-scale inference more practical for large-scale datasets.

Abstract

Multi-scale inference is commonly used to improve the results of semantic segmentation. Multiple images scales are passed through a network and then the results are combined with averaging or max pooling. In this work, we present an attention-based approach to combining multi-scale predictions. We show that predictions at certain scales are better at resolving particular failures modes, and that the network learns to favor those scales for such cases in order to generate better predictions. Our attention mechanism is hierarchical, which enables it to be roughly 4x more memory efficient to train than other recent approaches. In addition to enabling faster training, this allows us to train with larger crop sizes which leads to greater model accuracy. We demonstrate the result of our method on two datasets: Cityscapes and Mapillary Vistas. For Cityscapes, which has a large number of weakly labelled images, we also leverage auto-labelling to improve generalization. Using our approach we achieve a new state-of-the-art results in both Mapillary (61.1 IOU val) and Cityscapes (85.1 IOU test).

Paper Structure

This paper contains 13 sections, 1 equation, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Illustration of common failures modes for semantic segmentation as they relate to inference scale. In the first row, the thin posts are inconsistently segmented in the scaled down (0.5x) image, but better predicted in the scaled-up (2.0x) image. In the second row, the large road / divider region is better segmented at lower resolution (0.5x).
  • Figure 2: Network Architecture Left and right panels show explicit vs. hierarchical (Ours) architectures, respectively. Left shows the architecture from chen2015attention, where the attention for each scale is learned explicitly. Right shows our hierarchical attention architecture. Right top An illustration of our training pipeline, whereby the network learns to predict attention between adjacent scale pairs. Right bottom Inference is performed in a chained/hierarchical manner in order to combine multiple scales of predictions. Lower scale attention determines the contribution of the next higher scale.
  • Figure 3: Semantic and attention predictions at every scale level for two different scenes. The scene on the left illustrates a fine detail problem while the scene on the right illustrates a large region segmentation problem. A white color for attention indicates a high value (close to 1.0). The attention values for a given pixel across all scales sums to 1.0. Left: The thin road-side posts are best resolved at 2x scale, and the attention successfully attends more to that scale than other scales, as evidenced by the white color for the posts in the 2x attention image. Right: The large road/divider region is best predicted at 0.5x scale, and the attention does successfully focus most heavily on the 0.5x scale for that region.
  • Figure 4: Example of our auto-generated coarse image labels. Auto-generated coarse labels (right) provide finer detail of labelling than the original ground truth coarse labels (middle). This finer labelling improves the distribution of the labels since both small and large items are now represented, as opposed to primarily large items.
  • Figure 5: Qualitative Results. From left to right: input, ground truth, our method on Cityscapes.