Table of Contents
Fetching ...

OCNet: Object Context Network for Scene Parsing

Yuhui Yuan, Lang Huang, Jianyuan Guo, Chao Zhang, Xilin Chen, Jingdong Wang

TL;DR

OCNet introduces object-context pooling to explicitly emphasize same-object pixel relations in semantic segmentation. It offers two implementations— a dense relation via self-attention and an efficient interlaced sparse self-attention (ISA)—and extends them with Pyramid-OC and ASP-OC to capture multi-scale object context. It demonstrates competitive or state-of-the-art performance on Cityscapes, ADE20K, LIP, PASCAL-Context, and COCO-Stuff, while reducing computational cost relative to standard self-attention approaches. The approach also proves beneficial when integrated into Mask-RCNN, indicating broad applicability to dense prediction tasks.

Abstract

In this paper, we address the semantic segmentation task with a new context aggregation scheme named \emph{object context}, which focuses on enhancing the role of object information. Motivated by the fact that the category of each pixel is inherited from the object it belongs to, we define the object context for each pixel as the set of pixels that belong to the same category as the given pixel in the image. We use a binary relation matrix to represent the relationship between all pixels, where the value one indicates the two selected pixels belong to the same category and zero otherwise. We propose to use a dense relation matrix to serve as a surrogate for the binary relation matrix. The dense relation matrix is capable to emphasize the contribution of object information as the relation scores tend to be larger on the object pixels than the other pixels. Considering that the dense relation matrix estimation requires quadratic computation overhead and memory consumption w.r.t. the input size, we propose an efficient interlaced sparse self-attention scheme to model the dense relations between any two of all pixels via the combination of two sparse relation matrices. To capture richer context information, we further combine our interlaced sparse self-attention scheme with the conventional multi-scale context schemes including pyramid pooling~\citep{zhao2017pyramid} and atrous spatial pyramid pooling~\citep{chen2018deeplab}. We empirically show the advantages of our approach with competitive performances on five challenging benchmarks including: Cityscapes, ADE20K, LIP, PASCAL-Context and COCO-Stuff

OCNet: Object Context Network for Scene Parsing

TL;DR

OCNet introduces object-context pooling to explicitly emphasize same-object pixel relations in semantic segmentation. It offers two implementations— a dense relation via self-attention and an efficient interlaced sparse self-attention (ISA)—and extends them with Pyramid-OC and ASP-OC to capture multi-scale object context. It demonstrates competitive or state-of-the-art performance on Cityscapes, ADE20K, LIP, PASCAL-Context, and COCO-Stuff, while reducing computational cost relative to standard self-attention approaches. The approach also proves beneficial when integrated into Mask-RCNN, indicating broad applicability to dense prediction tasks.

Abstract

In this paper, we address the semantic segmentation task with a new context aggregation scheme named \emph{object context}, which focuses on enhancing the role of object information. Motivated by the fact that the category of each pixel is inherited from the object it belongs to, we define the object context for each pixel as the set of pixels that belong to the same category as the given pixel in the image. We use a binary relation matrix to represent the relationship between all pixels, where the value one indicates the two selected pixels belong to the same category and zero otherwise. We propose to use a dense relation matrix to serve as a surrogate for the binary relation matrix. The dense relation matrix is capable to emphasize the contribution of object information as the relation scores tend to be larger on the object pixels than the other pixels. Considering that the dense relation matrix estimation requires quadratic computation overhead and memory consumption w.r.t. the input size, we propose an efficient interlaced sparse self-attention scheme to model the dense relations between any two of all pixels via the combination of two sparse relation matrices. To capture richer context information, we further combine our interlaced sparse self-attention scheme with the conventional multi-scale context schemes including pyramid pooling~\citep{zhao2017pyramid} and atrous spatial pyramid pooling~\citep{chen2018deeplab}. We empirically show the advantages of our approach with competitive performances on five challenging benchmarks including: Cityscapes, ADE20K, LIP, PASCAL-Context and COCO-Stuff

Paper Structure

This paper contains 25 sections, 25 equations, 15 figures, 12 tables, 2 algorithms.

Figures (15)

  • Figure 1: Illustrating the predicted dense relation matrices. The first column illustrates example images sampled from the Cityscapes val, and we mark three pixels on object car, person and road with ✙ respectively. The second column illustrates ground truth segmentation maps. The third column illustrates the dense relation matrices (or approximated object context maps) of the three pixels. We can see that the relation values corresponding to the pixels belonging to the same category as the selected pixel tend to be larger.
  • Figure 2: Illustrating the Interlaced Sparse Self-Attention. Our approach is consisted of a global relation module and a local relation module. The feature map in the left-most/right-most is the input/output. First, we color the input feature map $\mathbf{X}$ with four different colors. We can see that there are $4$ local groups and each group is consisted of four different colors. For the global relation module, we permute and group (divide) all positions with the same color having long spatial interval distances together in $\mathbf{X}$, which outputs $\mathbf{X}^{g}$. Then, we divide the $\mathbf{X}^{g}$ into $4$ groups and apply the self-attention on each group independently. We merge all the updated feature map of each group together as the output $\mathbf{Z}^{g}$. For the local relation module, we permute the $\mathbf{Z}^{g}$ to group the originally nearby positions together and get $\mathbf{X}^{l}$. Then we divide and apply self-attention following the same manner as global relation, obtain the final feature map $\mathbf{Z}^{l}$. We can propagate the information from all input positions to each output position with the combination of the global relation module and the local relation module. The positions with the same saturation of color mark the value of the feature maps are kept unchanged. We only increase the saturation the color when we update the feature maps with self-attention operation.
  • Figure 3: FLOPs vs. input size. The $x$-axis represents the height or width of the input feature map (we assume the height is equal to the width for convenience) and the $y$-axis represents the computation cost measured with GFLOPs. We can see that the GFLOPs of self-attention (SA) mechanism increases much faster than our interlaced sparse self-attention (ISA) mechanism with inputs of higher resolution.
  • Figure 4: GPU memory/FLOPs/Running time comparison between SA and ISA. All numbers are tested on a single Titan XP GPU with CUDA$8.0$ and an input feature map of $1\times 512\times 128\times 128$ during inference stage. The lower, the better for all metrics. We can see that the proposed interlaced sparse self-attention (ISA) only uses $10.2\%$ GPU memory and $24.6\%$ FLOPs while being nearly $2\times$ faster when compared with the self-attention (SA).
  • Figure 5: Interlaced Sparse Self-Attention.
  • ...and 10 more figures

Theorems & Definitions (1)

  • proof