Table of Contents
Fetching ...

FANet: Feature Amplification Network for Semantic Segmentation in Cluttered Background

Muhammad Ali, Mamoona Javaid, Mubashir Noman, Mustansar Fiaz, Salman Khan

TL;DR

Semantic segmentation in cluttered scenes with translucent objects challenges conventional CNNs and even some transformer-based approaches due to the need for both local detail and global context. The authors introduce FANet, a backbone that constructs multi-stage features and uses an Adaptive Feature Enhancement (AFE) block to fuse large-context information via a Spatial Context Module (SCM) with semantic refinement via a Feature Refinement Module (FRM) in parallel. FRM draws on image sharpening and contrast enhancement principles to preserve high-frequency details while capturing low-frequency blob semantics, and SCM expands the receptive field with a large kernel. On the ZeroWaste-f dataset, FANet achieves a mIoU of $54.89\%$ and Pixel Accuracy of $91.41\%$, outperforming competitive baselines such as DeepLabv3+ and FocalNet-B and demonstrating robust performance in highly cluttered and translucent-object scenes.

Abstract

Existing deep learning approaches leave out the semantic cues that are crucial in semantic segmentation present in complex scenarios including cluttered backgrounds and translucent objects, etc. To handle these challenges, we propose a feature amplification network (FANet) as a backbone network that incorporates semantic information using a novel feature enhancement module at multi-stages. To achieve this, we propose an adaptive feature enhancement (AFE) block that benefits from both a spatial context module (SCM) and a feature refinement module (FRM) in a parallel fashion. SCM aims to exploit larger kernel leverages for the increased receptive field to handle scale variations in the scene. Whereas our novel FRM is responsible for generating semantic cues that can capture both low-frequency and high-frequency regions for better segmentation tasks. We perform experiments over challenging real-world ZeroWaste-f dataset which contains background-cluttered and translucent objects. Our experimental results demonstrate the state-of-the-art performance compared to existing methods.

FANet: Feature Amplification Network for Semantic Segmentation in Cluttered Background

TL;DR

Semantic segmentation in cluttered scenes with translucent objects challenges conventional CNNs and even some transformer-based approaches due to the need for both local detail and global context. The authors introduce FANet, a backbone that constructs multi-stage features and uses an Adaptive Feature Enhancement (AFE) block to fuse large-context information via a Spatial Context Module (SCM) with semantic refinement via a Feature Refinement Module (FRM) in parallel. FRM draws on image sharpening and contrast enhancement principles to preserve high-frequency details while capturing low-frequency blob semantics, and SCM expands the receptive field with a large kernel. On the ZeroWaste-f dataset, FANet achieves a mIoU of and Pixel Accuracy of , outperforming competitive baselines such as DeepLabv3+ and FocalNet-B and demonstrating robust performance in highly cluttered and translucent-object scenes.

Abstract

Existing deep learning approaches leave out the semantic cues that are crucial in semantic segmentation present in complex scenarios including cluttered backgrounds and translucent objects, etc. To handle these challenges, we propose a feature amplification network (FANet) as a backbone network that incorporates semantic information using a novel feature enhancement module at multi-stages. To achieve this, we propose an adaptive feature enhancement (AFE) block that benefits from both a spatial context module (SCM) and a feature refinement module (FRM) in a parallel fashion. SCM aims to exploit larger kernel leverages for the increased receptive field to handle scale variations in the scene. Whereas our novel FRM is responsible for generating semantic cues that can capture both low-frequency and high-frequency regions for better segmentation tasks. We perform experiments over challenging real-world ZeroWaste-f dataset which contains background-cluttered and translucent objects. Our experimental results demonstrate the state-of-the-art performance compared to existing methods.
Paper Structure (13 sections, 3 equations, 6 figures, 4 tables)

This paper contains 13 sections, 3 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: The main challenges in semantic segmentation, e.g., translucent objects, background clutter, and scale variations. The first row indicates the input image while the bottom row shows the input image overlay with the ground truth.
  • Figure 2: Given an image $I$ (a) and its smoothed version $I_b$ (b), element-wise multiplication of $I$ and $I_b$ emphasize the color information and preserve the blob regions (c). Whereas (d) fine details can be highlighted by subtracting the smoothed image $I_b$ from the original image $I$. Motivated by this, we introduce our feature refinement module (FRM).
  • Figure 3: The (a) is the overall illustration of our proposed feature amplification network (FANet), as a backbone network, for background-cluttered semantic segmentation. The input is passed to a backbone network (FANet) to produce the multi-stage features ($S1, S2, S3,$ and $S4$). These multi-stage features are input to a UperNet decoder xiao2018unified as a segmentation head for prediction. The (b) shows our novel adaptive feature enhancement (AFE) block. Our (b) AFE block is designed to capture the rich information. It comprises convolutional embeddings (CE), spatial context module (SCM), feature refinement module (FRM), and ConvMLP. Our AFF block adaptively aggregates the large kernel information using SCM which increases the receptive field and FRM which refines the features in the spatial dimension.
  • Figure 4: The illustration of our novel feature amplification module. The input features $F$ are downsampled using depthwise convolution (DWConv) and upsampled to get $Q$ features. The input features $F$ are subtracted from $Q$ features to get $R$ features that highlight the fine details. Similarly, the input features are multiplied with $Q$ features to obtain $S$ features which highlight the low-frequency components in the spatial dimension. Later, these low-frequency and high-frequency features are aggregated after DWConv to obtain enhanced features. Finally, the aggregated features are input to the projection layer to obtain the final $\bar{F}$ features.
  • Figure 5: Qualitative comparison on ZeroWaste-f dataset. Our method better segments the objects from cluttered backgrounds compared to existing state-of-the-art methods.
  • ...and 1 more figures