Weakly-Supervised Semantic Segmentation of Circular-Scan, Synthetic-Aperture-Sonar Imagery

Isaac J. Sledge; Dominic M. Byrne; Jonathan L. King; Steven H. Ostertag; Denton L. Woods; James L. Prater; Jermaine L. Kennedy; Timothy M. Marston; Jose C. Principe

Weakly-Supervised Semantic Segmentation of Circular-Scan, Synthetic-Aperture-Sonar Imagery

Isaac J. Sledge, Dominic M. Byrne, Jonathan L. King, Steven H. Ostertag, Denton L. Woods, James L. Prater, Jermaine L. Kennedy, Timothy M. Marston, Jose C. Principe

TL;DR

This work proposes a weakly-supervised framework for the semantic segmentation of circular-scan synthetic-aperture-sonar (CSAS) imagery and shows that this framework performs comparably to nine fully-supervised deep networks and outperforms eleven of the best weakly-supervised deep networks.

Abstract

We propose a weakly-supervised framework for the semantic segmentation of circular-scan synthetic-aperture-sonar (CSAS) imagery. The first part of our framework is trained in a supervised manner, on image-level labels, to uncover a set of semi-sparse, spatially-discriminative regions in each image. The classification uncertainty of each region is then evaluated. Those areas with the lowest uncertainties are then chosen to be weakly labeled segmentation seeds, at the pixel level, for the second part of the framework. Each of the seed extents are progressively resized according to an unsupervised, information-theoretic loss with structured-prediction regularizers. This reshaping process uses multi-scale, adaptively-weighted features to delineate class-specific transitions in local image content. Content-addressable memories are inserted at various parts of our framework so that it can leverage features from previously seen images to improve segmentation performance for related images. We evaluate our weakly-supervised framework using real-world CSAS imagery that contains over ten seafloor classes and ten target classes. We show that our framework performs comparably to nine fully-supervised deep networks. Our framework also outperforms eleven of the best weakly-supervised deep networks. We achieve state-of-the-art performance when pre-training on natural imagery. The average absolute performance gap to the next-best weakly-supervised network is well over ten percent for both natural imagery and sonar imagery. This gap is found to be statistically significant.

Weakly-Supervised Semantic Segmentation of Circular-Scan, Synthetic-Aperture-Sonar Imagery

TL;DR

Abstract

Paper Structure (17 sections, 4 equations, 27 figures)

This paper contains 17 sections, 4 equations, 27 figures.

Appendix B

Figures (27)

Figure 3.1: A summary of the major steps in our framework. Steps 1 and 2 are addressed by the network in \ref{['fig:can-network']}. Steps 3, 4, 5, and 6 are addressed by the networks in \ref{['fig:cren-network']} and \ref{['fig:usn-network']}. Step 7 is addressed by the network in \ref{['fig:sfn-network']}. Note that bathymetric details are shown in the above CSAS image. Depth cues are hence available for this scene. We do not utilize such cues in the current study to help with segmentation. This is because interferometry data was only captured for a small subset of the scenes in the dataset that we use. However, bathymetry can be leveraged by our framework without any design changes.
Figure 3.2: An overview of our supervised network for assessing class-activation mappings. (a) A network diagram of our class-activation network ( CAN) and the resulting class-activation outputs for a CSAS image containing a variety of seafloor types. The encoder branch of our network ( CAN-E) relies on a series of multi-scale convolution banks that extract local-global spectral features. These banks enable our network can discern high-quality semantic features with both small and large receptive fields despite using only a few filters. Universal, recurrent memory layers are inserted within the branch to store and recall multi-context details to further improve the feature quality in an efficient way. All of these features are aggregated in the deepest part of the encoder before being transformed into a probabilistic, open-set classification response in the classifier branch ( CAN-C). Lift-CAM is then used to infer a class-activation mapping from the CAN classification response. For this diagram, spectral convolutional layers (SConv) are denoted via light blue blocks. Darker blue bands on these blocks are used to signify that rectified-linear-unit activations are applied. Green blocks correspond to the content-addressable, universal-recurrent memory cells (UREM). Spectral average pooling layers (SAPool) are denoted using red blocks. The fully-connected, openmax aggregation layer (FC) is denoted using a gray block. (b) A tabular summary of the major network layers of the CAN. For each layer, we list its name, its numerical order in the network, the kernel size, the stride either the number of channels or the number of elements, and the index of the layer that feeds into it. We recommend that readers consult the electronic version of this paper to see the full image details.
Figure 3.3: Examples of assessing feature importance as a pre-processor for weakly-supervised segmentation. (a) A bathymetric CSAS image of an underwater scene that contains a plastic barrel. A non-color-by-aspect encoding is used here. (b) A Grad-CAM-inferred class-activation mapping from our CANZhouB-conf2016aSelvarajuRR-conf2017a. (c) A Lift-CAM-inferred class-activation mapping from our CANJungH-conf2021a. The former would provide poor seed cues for semantically segmenting this scene. The latter, however, would not for this scene. This is largely because Lift-CAM scores quantify how the expected model performance changes when conditioning on a particular feature. Such scores weight features based on their importance and often emphasize well class-specific regions in the samples. In this scene, they highlight a majority of the plastic barrel. They also identify well the flat sand and indented sand regions.
Figure 3.4: An overview of our unsupervised network for semantic segmentation. (a) A network diagram of our convolutional region-expansion network ( CREN) the resulting segmentation outputs for a CSAS image containing a crashed helicopter. The encoder branches of our network ( CREN-E) rely on a series of multi-scale convolution banks that extract local-global spectral features. These features are obtained both for the input image and from the class-activation mappings produced by the CAN in \ref{['fig:can-network']}. Dual inputs are used to provide both high- and low-level cues so that the network can good semantic features that aid in classification. Using only a single input, such as the class-activation mapping, can impede segmentation, since local image content is obscured by the pixel affinities. Universal, recurrent memory layers are inserted within the branch to store and recall multi-context details to further improve the feature quality. The features are aggregated in the deepest part of the encoders. We apply an adaptive, large-margin regularization to re-organize the features. This helps mitigate segmentation errors. An initial segmentation map is formed in the decoder stage ( CREN-D) and is progressively upsampled and refined before an openmax activation is applied to yield a probabilistic classification response. Superpixels, generated from our unsupervised superpixel network ( USN) in \ref{['fig:usn-network']}, are then employed to enforce consistent spatial labeling. The block coloring scheme used in \ref{['fig:can-network']} is reused in this diagram. We use red blocks followed by green blocks to denote upsampling by spectral, transposed local-global convolution layers (STConv). (b) A tabular summary of the major network layers of the CREN. We recommend that readers consult the electronic version of this paper to see the full image details.
Figure 3.5: An overview of our unsupervised network for generating superpixels. (a) A network diagram of our unsupervised superpixel network ( USN) the resulting segmentation outputs for a CSAS image containing a crashed helicopter. The encoder branch of our network ( USN-E) relies on dual multi-scale convolution banks that extract local-global spectral features from the CSAS image. These features are aggregated and then upscaled to approximately reconstruct the CSAS image. This yields an incredibly compact network that eschews feature redundancy, which facilitates robust superpixel generation. Seed cues are selected from the embedded features in the decoder stage ( USN-C). Non-iterative region merging occurs to yield a spatial grouping of pixels. The block coloring scheme used in \ref{['fig:can-network']} is reused in this diagram. Yellow-colored blocks denote non-iterative clustering layers for either generating the initial superpixels or assigning pixel affinity. (b) A tabular summary of the major network layers of the USN. We recommend that readers consult the electronic version of this paper to see the full image details.
...and 22 more figures

Weakly-Supervised Semantic Segmentation of Circular-Scan, Synthetic-Aperture-Sonar Imagery

TL;DR

Abstract

Weakly-Supervised Semantic Segmentation of Circular-Scan, Synthetic-Aperture-Sonar Imagery

Authors

TL;DR

Abstract

Table of Contents

Figures (27)