Table of Contents
Fetching ...

OLAF: A Plug-and-Play Framework for Enhanced Multi-object Multi-part Scene Parsing

Pranav Gupta, Rishubh Singh, Pradeep Shenoy, Ravikiran Sarvadevabhatla

TL;DR

This work proposes a weight adaptation technique which enables regular (RGB) pre-trained models to process the augmented (5-channel) input in a stable manner during optimization, and introduces an encoder module termed LDF to provide low-level dense feature guidance.

Abstract

Multi-object multi-part scene segmentation is a challenging task whose complexity scales exponentially with part granularity and number of scene objects. To address the task, we propose a plug-and-play approach termed OLAF. First, we augment the input (RGB) with channels containing object-based structural cues (fg/bg mask, boundary edge mask). We propose a weight adaptation technique which enables regular (RGB) pre-trained models to process the augmented (5-channel) input in a stable manner during optimization. In addition, we introduce an encoder module termed LDF to provide low-level dense feature guidance. This assists segmentation, particularly for smaller parts. OLAF enables significant mIoU gains of $\mathbf{3.3}$ (Pascal-Parts-58), $\mathbf{3.5}$ (Pascal-Parts-108) over the SOTA model. On the most challenging variant (Pascal-Parts-201), the gain is $\mathbf{4.0}$. Experimentally, we show that OLAF's broad applicability enables gains across multiple architectures (CNN, U-Net, Transformer) and datasets. The code is available at olafseg.github.io

OLAF: A Plug-and-Play Framework for Enhanced Multi-object Multi-part Scene Parsing

TL;DR

This work proposes a weight adaptation technique which enables regular (RGB) pre-trained models to process the augmented (5-channel) input in a stable manner during optimization, and introduces an encoder module termed LDF to provide low-level dense feature guidance.

Abstract

Multi-object multi-part scene segmentation is a challenging task whose complexity scales exponentially with part granularity and number of scene objects. To address the task, we propose a plug-and-play approach termed OLAF. First, we augment the input (RGB) with channels containing object-based structural cues (fg/bg mask, boundary edge mask). We propose a weight adaptation technique which enables regular (RGB) pre-trained models to process the augmented (5-channel) input in a stable manner during optimization. In addition, we introduce an encoder module termed LDF to provide low-level dense feature guidance. This assists segmentation, particularly for smaller parts. OLAF enables significant mIoU gains of (Pascal-Parts-58), (Pascal-Parts-108) over the SOTA model. On the most challenging variant (Pascal-Parts-201), the gain is . Experimentally, we show that OLAF's broad applicability enables gains across multiple architectures (CNN, U-Net, Transformer) and datasets. The code is available at olafseg.github.io

Paper Structure

This paper contains 20 sections, 3 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: The recipe for OLAF, our plug-and-play framework for enhanced multi-object multi-part scene parsing: Augment RGB input with object-based channels (fg/bg, boundary edges) obtained from frozen pre-trained models ($M_O,M_E$) Use Low-level Dense Feature guidance from segmentation encoder (LDF, shaded green) Employ targeted weight adaptation for stable optimization with augmented input. We show that following this recipe leads to significant gains (up to $\mathbf{4.0}$ mIoU) across multiple architectures and across multiple challenging datasets.
  • Figure 2: The segmentation results for state-of-the-art approach FLOAT float and its limitations can be seen in the second column. In the first row, FLOAT completely fails to identify TV Frame and TV Screen. In the second row, FLOAT fails to capture the edge partition between Car-Body, Car-Tire and also between Car-Body, Car-Window. The third column shows results by incorporating our plug-and-play approach OLAF into FLOAT, leading to significantly improved object and part segmentation results.
  • Figure 3: Illustration of OLAF's architectural integration with FLOAT float(Sec. \ref{['sec:methodology_of_olaf']}). FLOAT's components are tagged with $\bigstar$. The object masks from output $S_o$ of object segmentation network $\mathcal{M}_o$ are merged to obtain the foreground map $fg$. The output of edge generation network $\mathcal{M}_e$ is thresholded and filtered using $fg$ to obtain edge map $edge$. The obtained maps are stacked with input image $I$ to obtain the $5$-channel input $I^{'}$ for the part segmentation network $\mathcal{F}$. The interface for LDF (Sec. \ref{['sec:lldfe']}) with encoder $E_{part}$ and its architecture (top right) are also shown. A similar integration of OLAF also exists for U-Net style and Transformer style architectures.
  • Figure 4: LDF (\ref{['sec:lldfe']}) consistently improves the performance of small/thin parts. As shown in Row I, FLOAT float with LDF successfully predicts Aeroplane-Body while FLOAT fails to do so adequately. Similar results can be seen in Row II, where LDF successfully predicts Car-Light which FLOAT completely misses.
  • Figure 5: Qualitative comparison on Pascal-Part-201. OLAF consistently improves the performance of previous methods (BSANet bsanet, FLOAT float). This is especially seen for small parts as shown in Row 1 (eyes, ears, nose and right-front-leg), Row 2 (eye, ears, nose, mouth and tail) and occluded parts as shown in Row 3 (parts of the motorbike).
  • ...and 2 more figures