Table of Contents
Fetching ...

Auxiliary Tasks Enhanced Dual-affinity Learning for Weakly Supervised Semantic Segmentation

Lian Xu, Mohammed Bennamoun, Farid Boussaid, Wanli Ouyang, Ferdous Sohel, Dan Xu

TL;DR

A cross-task dual-affinity learning module to learn both pairwise and unary affinities, which are used to enhance the task-specific features and predictions by aggregating both query-dependent and query-independent global context for both saliency detection and semantic segmentation.

Abstract

Most existing weakly supervised semantic segmentation (WSSS) methods rely on Class Activation Mapping (CAM) to extract coarse class-specific localization maps using image-level labels. Prior works have commonly used an off-line heuristic thresholding process that combines the CAM maps with off-the-shelf saliency maps produced by a general pre-trained saliency model to produce more accurate pseudo-segmentation labels. We propose AuxSegNet+, a weakly supervised auxiliary learning framework to explore the rich information from these saliency maps and the significant inter-task correlation between saliency detection and semantic segmentation. In the proposed AuxSegNet+, saliency detection and multi-label image classification are used as auxiliary tasks to improve the primary task of semantic segmentation with only image-level ground-truth labels. We also propose a cross-task affinity learning mechanism to learn pixel-level affinities from the saliency and segmentation feature maps. In particular, we propose a cross-task dual-affinity learning module to learn both pairwise and unary affinities, which are used to enhance the task-specific features and predictions by aggregating both query-dependent and query-independent global context for both saliency detection and semantic segmentation. The learned cross-task pairwise affinity can also be used to refine and propagate CAM maps to provide better pseudo labels for both tasks. Iterative improvement of segmentation performance is enabled by cross-task affinity learning and pseudo-label updating. Extensive experiments demonstrate the effectiveness of the proposed approach with new state-of-the-art WSSS results on the challenging PASCAL VOC and MS COCO benchmarks.

Auxiliary Tasks Enhanced Dual-affinity Learning for Weakly Supervised Semantic Segmentation

TL;DR

A cross-task dual-affinity learning module to learn both pairwise and unary affinities, which are used to enhance the task-specific features and predictions by aggregating both query-dependent and query-independent global context for both saliency detection and semantic segmentation.

Abstract

Most existing weakly supervised semantic segmentation (WSSS) methods rely on Class Activation Mapping (CAM) to extract coarse class-specific localization maps using image-level labels. Prior works have commonly used an off-line heuristic thresholding process that combines the CAM maps with off-the-shelf saliency maps produced by a general pre-trained saliency model to produce more accurate pseudo-segmentation labels. We propose AuxSegNet+, a weakly supervised auxiliary learning framework to explore the rich information from these saliency maps and the significant inter-task correlation between saliency detection and semantic segmentation. In the proposed AuxSegNet+, saliency detection and multi-label image classification are used as auxiliary tasks to improve the primary task of semantic segmentation with only image-level ground-truth labels. We also propose a cross-task affinity learning mechanism to learn pixel-level affinities from the saliency and segmentation feature maps. In particular, we propose a cross-task dual-affinity learning module to learn both pairwise and unary affinities, which are used to enhance the task-specific features and predictions by aggregating both query-dependent and query-independent global context for both saliency detection and semantic segmentation. The learned cross-task pairwise affinity can also be used to refine and propagate CAM maps to provide better pseudo labels for both tasks. Iterative improvement of segmentation performance is enabled by cross-task affinity learning and pseudo-label updating. Extensive experiments demonstrate the effectiveness of the proposed approach with new state-of-the-art WSSS results on the challenging PASCAL VOC and MS COCO benchmarks.
Paper Structure (14 sections, 14 equations, 13 figures, 12 tables, 1 algorithm)

This paper contains 14 sections, 14 equations, 13 figures, 12 tables, 1 algorithm.

Figures (13)

  • Figure 1: An illustration of the proposed approach for weakly supervised semantic segmentation. Our approach jointly learns two auxiliary tasks (i.e., multi-label image classification and saliency detection) and a primary task (i.e., semantic segmentation) only using image-level ground-truth labels, and performs affinity learning across two dense prediction tasks (i.e., saliency detection and semantic segmentation). The learned affinity is then used to generate updated pseudo ground-truth (PGT) providing supervision to learn saliency detection and semantic segmentation.
  • Figure 2: An overview of the proposed AuxSegNet+. An input RGB image (a) is first passed through a backbone network to extract image features, which are then fed to three branches for multi-label image classification (b), saliency detection (c & f), and semantic segmentation (e & g), respectively. The proposed cross-task affinity learning module (see Fig. \ref{['sa_ctal']}) takes as inputs the segmentation and saliency feature maps, and outputs enhanced feature maps for predicting both tasks (c & e) and the cross-task affinity maps (d) including a unary and a pairwise affinity maps for task-specific prediction refinement (f & g). The refined saliency predictions are used to update pseudo saliency labels, and the learned cross-task pairwise affinity map is used to refine CAM maps (h) to update pseudo segmentation labels (i) to retrain the network. The network training (black solid lines) and label updating (blue dashed lines) are performed alternatively for multiple stages (i.e.,$s=1,2,...,S$) to learn more reliable affinity maps and produce more accurate segmentation predictions.
  • Figure 3: The structure of the proposed cross-task affinity learning module. (a) The cross-task pairwise affinity learning module xu2021leveraging, first generates two task-specific pairwise affinity maps by applying the self-attention mechanism on the saliency and segmentation feature maps, respectively. The task-specific pairwise affinity maps are used to refine these corresponding input feature maps. They are further fused to produce a cross-task affinity map. (b) the proposed cross-task dual-affinity learning module, which uses a dual-attention mechanism (detailed in Fig. \ref{['dual_attention']}) to generate a unary affinity map and a pairwise affinity map for the saliency and segmentation feature maps, respectively. The task-specific dual affinity maps are used to refine the corresponding feature maps, and these two task-specific dual affinity maps are further fused to produce a cross-task unary affinity map and a cross-task pairwise affinity map, respectively.
  • Figure 4: Detailed structure of the feature refinement module and the dual-attention mechanism, which captures both the pairwise and unary affinities. $\mathbf{W}_v$, $\mathbf{W}_k$, $\mathbf{W}_q$, and $\mathbf{W}_u$ denote the weight matrices of four $1\times1$ convolutions, respectively.
  • Figure 5: Qualitative segmentation results on the PASCAL VOC val set. (a) Input images. (b) Ground-truth segmentation masks. (c) Predicted segmentation masks by EPS lee2021railroad. (d) Predicted segmentation masks by AuxSegNet. (e) Predicted segmentation masks by AuxSegNet+.
  • ...and 8 more figures