Table of Contents
Fetching ...

FCN+: Global Receptive Convolution Makes FCN Great Again

Xiaoyu Ren, Zhongying Deng, Jin Ye, Junjun He, Dongxu Yang

TL;DR

This work introduces Global Receptive Convolution (GRC), a parameter-free mechanism that assigns channel-dependent sampling locations to enlarge the receptive field and inject global context into each pixel representation. By replacing the last-stage convolutions in FCN with GRC blocks, the authors form FCN+, a lightweight segmentation model that achieves competitive results on PASCAL VOC 2012, Cityscapes, and ADE20K without extra learnable parameters or decoders. Theoretical analysis confirms a global receptive field via the dual-channel grouping, and experiments demonstrate consistent improvements over FCN across multiple datasets, with additional gains in cross-dataset UDA and modest gains in image classification. The approach offers a simple, efficient alternative to multi-scale or attention-based context integration, with practical impact for fast, context-aware semantic segmentation.

Abstract

Fully convolutional network (FCN) is a seminal work for semantic segmentation. However, due to its limited receptive field, FCN cannot effectively capture global context information which is vital for semantic segmentation. As a result, it is beaten by state-of-the-art methods that leverage different filter sizes for larger receptive fields. However, such a strategy usually introduces more parameters and increases the computational cost. In this paper, we propose a novel global receptive convolution (GRC) to effectively increase the receptive field of FCN for context information extraction, which results in an improved FCN termed FCN+. The GRC provides the global receptive field for convolution without introducing any extra learnable parameters. The motivation of GRC is that different channels of a convolutional filter can have different grid sampling locations across the whole input feature map. Specifically, the GRC first divides the channels of the filter into two groups. The grid sampling locations of the first group are shifted to different spatial coordinates across the whole feature map, according to their channel indexes. This can help the convolutional filter capture the global context information. The grid sampling location of the second group remains unchanged to keep the original location information. By convolving using these two groups, the GRC can integrate the global context into the original location information of each pixel for better dense prediction results. With the GRC built in, FCN+ can achieve comparable performance to state-of-the-art methods for semantic segmentation tasks, as verified on PASCAL VOC 2012, Cityscapes, and ADE20K. Our code will be released at https://github.com/Zhongying-Deng/FCN_Plus.

FCN+: Global Receptive Convolution Makes FCN Great Again

TL;DR

This work introduces Global Receptive Convolution (GRC), a parameter-free mechanism that assigns channel-dependent sampling locations to enlarge the receptive field and inject global context into each pixel representation. By replacing the last-stage convolutions in FCN with GRC blocks, the authors form FCN+, a lightweight segmentation model that achieves competitive results on PASCAL VOC 2012, Cityscapes, and ADE20K without extra learnable parameters or decoders. Theoretical analysis confirms a global receptive field via the dual-channel grouping, and experiments demonstrate consistent improvements over FCN across multiple datasets, with additional gains in cross-dataset UDA and modest gains in image classification. The approach offers a simple, efficient alternative to multi-scale or attention-based context integration, with practical impact for fast, context-aware semantic segmentation.

Abstract

Fully convolutional network (FCN) is a seminal work for semantic segmentation. However, due to its limited receptive field, FCN cannot effectively capture global context information which is vital for semantic segmentation. As a result, it is beaten by state-of-the-art methods that leverage different filter sizes for larger receptive fields. However, such a strategy usually introduces more parameters and increases the computational cost. In this paper, we propose a novel global receptive convolution (GRC) to effectively increase the receptive field of FCN for context information extraction, which results in an improved FCN termed FCN+. The GRC provides the global receptive field for convolution without introducing any extra learnable parameters. The motivation of GRC is that different channels of a convolutional filter can have different grid sampling locations across the whole input feature map. Specifically, the GRC first divides the channels of the filter into two groups. The grid sampling locations of the first group are shifted to different spatial coordinates across the whole feature map, according to their channel indexes. This can help the convolutional filter capture the global context information. The grid sampling location of the second group remains unchanged to keep the original location information. By convolving using these two groups, the GRC can integrate the global context into the original location information of each pixel for better dense prediction results. With the GRC built in, FCN+ can achieve comparable performance to state-of-the-art methods for semantic segmentation tasks, as verified on PASCAL VOC 2012, Cityscapes, and ADE20K. Our code will be released at https://github.com/Zhongying-Deng/FCN_Plus.
Paper Structure (17 sections, 14 equations, 3 figures, 8 tables)

This paper contains 17 sections, 14 equations, 3 figures, 8 tables.

Figures (3)

  • Figure 1: Comparison of global receptive convolution with the other convolutions. (a) 3$\times$3 dilated convolution with dilation=2. (b) 3$\times$3 deformable convolution. Deformable convolution learns offset for each spatial location to enlarge receptive fields. For both (a) and (b), different channels of their filters share exactly the same grid sampling location, thus having limited receptive fields. (c) 1$\times$1 global receptive convolution (GRC). The GRC divides the 1$\times$1 kernel into two groups along the channel dimension, e.g., the light blue and the other colors. The group of light blue convolves the central pixel to keep the original location information. While the group of other colors are shifted to other spatial locations to capture the global context information. Note that 1$\times$1 GRC has a global receptive field, much larger than that of 3$\times$3 dilated convolution or deformable convolution.
  • Figure 2: Illustration for 3$\times$3 global receptive convolution (GRC). This example shows how the GRC of 3$\times$3 filter size convolves a 6$\times$6$\times$8 input feature map $F$, with the central pixel $(h_0, w_0)=(1,1)$ (shown as the light red dot in the top row). GRC divides the channel into two groups. For the group of $c=\{0,1,2,3\}$, standard convolution is applied, as in the blue cube at the middle left. For the group of $c=\{4,5,6,7\}$, we further split it into $g_h \cdot g_w=4$ sub-group, with $g_h=g_w=2$. So each sub-group group has $\lfloor \frac{C}{2g_h\cdot g_w} \rfloor$=1 channel. Accordingly, the feature maps are divided into $g_h \cdot g_w=4$ sub-group, with spatial patch comprising $\frac{H}{g_h}\times\frac{W}{g_w}$ =3$\times$3 pixels. According to Eq. (6) or (7), $\hat{\mathcal{N}}=\{(0,0), (0, 3), (3,0), (3,3)\}$. Given channel index like $c=5$, we can obtain $n=1$ via Eq. (8). Thus, for the 5th channel ($c=5$), the offset is $\hat{\mathcal{N}}_1=(0,3)$. This offset, shown as the dashed arrow in the middle row, is then added to $(h_0, w_0)=(1,1)$ to shift the central sampling location. The new central sampling location is $(1,4)$. Centering on this location, the convolution is applied in the 5th channel-wise sub-group, as shown in the green cube. Finally, the bottom row depicts the 2D region of the input data that influence the unit/neuron at the location $(1, 1)$ of the output feature map $F'$. It can be seen that the neuron $(1, 1)$ depends on the entire region of the input data, i.e., the red, green, purple, and yellow regions cover the whole 6$\times$6 input data.
  • Figure 3: The architecture of FCN+. As in FCN long2015fully, FCN+ combines coarse high-level information with fine low-level information. FCN+ adopts a ResNet backbone which has five residual stages (Stage 1 omitted for simplicity). The implementation of FCN head can refer to MMSegmentation mmseg2020. The last stage of FCN+ is replaced with GRC-based residual blocks.