Table of Contents
Fetching ...

AttEntropy: On the Generalization Ability of Supervised Semantic Segmentation Transformers to New Objects in New Domains

Krzysztof Lis, Matthias Rottmann, Annika Mütze, Sina Honari, Pascal Fua, Mathieu Salzmann

TL;DR

The paper tackles the problem of segmenting unseen objects in new domains using vision transformers trained for semantic segmentation. It introduces AttEntropy, a method that converts intermediate spatial attention maps into entropy heatmaps via the Shannon entropy $E^l(Z^{l-1})_j = - \sum_{j'} \bar{A}^{l}(Z^{l-1})_{j,j'} \log \bar{A}^{l}(Z^{l-1})_{j,j'}$, enabling segmentation of never-seen-before categories. The authors validate AttEntropy across multiple backbones and datasets, including Cityscapes-based models and broader domains like Lunar, Maritime, and Aircraft, and demonstrate robustness under no, partial, and complete domain shifts. The results show that a training-free entropy-based cue can approach, and in some cases compete with, training-based obstacle/detection methods while incurring negligible overhead, with automatic layer selection further enhancing performance. The work suggests a practical pathway for open-world segmentation and pre-segmentation in robotics and autonomous driving, enabling rapid adaptation to new object categories without additional training.

Abstract

In addition to impressive performance, vision transformers have demonstrated remarkable abilities to encode information they were not trained to extract. For example, this information can be used to perform segmentation or single-view depth estimation even though the networks were only trained for image recognition. We show that a similar phenomenon occurs when explicitly training transformers for semantic segmentation in a supervised manner for a set of categories: Once trained, they provide valuable information even about categories absent from the training set. This information can be used to segment objects from these never-seen-before classes in domains as varied as road obstacles, aircraft parked at a terminal, lunar rocks, and maritime hazards.

AttEntropy: On the Generalization Ability of Supervised Semantic Segmentation Transformers to New Objects in New Domains

TL;DR

The paper tackles the problem of segmenting unseen objects in new domains using vision transformers trained for semantic segmentation. It introduces AttEntropy, a method that converts intermediate spatial attention maps into entropy heatmaps via the Shannon entropy , enabling segmentation of never-seen-before categories. The authors validate AttEntropy across multiple backbones and datasets, including Cityscapes-based models and broader domains like Lunar, Maritime, and Aircraft, and demonstrate robustness under no, partial, and complete domain shifts. The results show that a training-free entropy-based cue can approach, and in some cases compete with, training-based obstacle/detection methods while incurring negligible overhead, with automatic layer selection further enhancing performance. The work suggests a practical pathway for open-world segmentation and pre-segmentation in robotics and autonomous driving, enabling rapid adaptation to new object categories without additional training.

Abstract

In addition to impressive performance, vision transformers have demonstrated remarkable abilities to encode information they were not trained to extract. For example, this information can be used to perform segmentation or single-view depth estimation even though the networks were only trained for image recognition. We show that a similar phenomenon occurs when explicitly training transformers for semantic segmentation in a supervised manner for a set of categories: Once trained, they provide valuable information even about categories absent from the training set. This information can be used to segment objects from these never-seen-before classes in domains as varied as road obstacles, aircraft parked at a terminal, lunar rocks, and maritime hazards.
Paper Structure (21 sections, 10 equations, 13 figures, 5 tables)

This paper contains 21 sections, 10 equations, 13 figures, 5 tables.

Figures (13)

  • Figure 1: Segmenting objects from categories the network was not trained for. The attention-entropy extracted from a SETR Zheng21 transformer trained on the urban driving Cityscapes Cordts16 dataset lets us segment objects from four previously unseen categories. In all four cases, we show an original image (left) and the corresponding Attention Entropy (right).
  • Figure 2: Entropy heatmaps. For each image patch, we compute the Shannon entropy of outgoing attentions. We show the spatial attention at two different image locations. The right image shows the Shannon entropy. Small objects receive concentrated attention and thus low corresponding entropy. An interactive tool for attention visualization is included in the supplementary material (and will be publicly available).
  • Figure 3: Qualitative results on obstacle detection in traffic scenes. The left two images are from LostAndFound Pinggera16 and the right three images are from RoadObstacle21 Chan21b, including ones with difficult weather and limited light. The middle and bottom rows show the averaged entropy of SETR and Segformer respectively, both using manual layer averaging. The heatmap is overlaid in the evaluation ROI. Slight rectangular artifacts arise from MMSegmentation's sliding window inference.
  • Figure 4: Qualitative results on obstacle detection on different datasets. The contour of the ground truth obstacle areas is highlighted. Compared to the other training-free obstacle detection methods, our attention entropy generalizes better to the distant domains.
  • Figure 5: Qualitative results on Aircraft AirbusAircraftSegmentation dataset.
  • ...and 8 more figures