Table of Contents
Fetching ...

Spatial As Deep: Spatial CNN for Traffic Scene Understanding

Xingang Pan, Xiaohang Zhan, Jianping Shi, Ping Luo, Xiaogang Wang, Xiaoou Tang

TL;DR

Spatial CNN (SCNN) introduces a slice-based spatial information diffusion mechanism within CNNs to capture long-range spatial relationships in traffic scenes, addressing the limitations of conventional CNNs for structured objects like lanes and poles. By propagating messages in four directions via a shared kernel and integrating at a top CNN layer, SCNN yields residual, efficient spatial diffusion that improves both lane detection and semantic segmentation, outperforming ReNet, MRfNet, Dense CRF, and even ResNet-101 baselines. The authors validate SCNN on a newly released large-scale lane-detection dataset and on Cityscapes, achieving a TuSimple benchmark-winning accuracy of 96.53% and notable IoU gains for several classes. The approach demonstrates that embedding directional, sequential spatial passes into CNNs can significantly enhance autonomous driving perception with minimal architectural disruption. Overall, SCNN offers a practical, end-to-end-friendly method to leverage spatial priors for both fine-grained and large-area understanding in traffic scenes.

Abstract

Convolutional neural networks (CNNs) are usually built by stacking convolutional operations layer-by-layer. Although CNN has shown strong capability to extract semantics from raw pixels, its capacity to capture spatial relationships of pixels across rows and columns of an image is not fully explored. These relationships are important to learn semantic objects with strong shape priors but weak appearance coherences, such as traffic lanes, which are often occluded or not even painted on the road surface as shown in Fig. 1 (a). In this paper, we propose Spatial CNN (SCNN), which generalizes traditional deep layer-by-layer convolutions to slice-byslice convolutions within feature maps, thus enabling message passings between pixels across rows and columns in a layer. Such SCNN is particular suitable for long continuous shape structure or large objects, with strong spatial relationship but less appearance clues, such as traffic lanes, poles, and wall. We apply SCNN on a newly released very challenging traffic lane detection dataset and Cityscapse dataset. The results show that SCNN could learn the spatial relationship for structure output and significantly improves the performance. We show that SCNN outperforms the recurrent neural network (RNN) based ReNet and MRF+CNN (MRFNet) in the lane detection dataset by 8.7% and 4.6% respectively. Moreover, our SCNN won the 1st place on the TuSimple Benchmark Lane Detection Challenge, with an accuracy of 96.53%.

Spatial As Deep: Spatial CNN for Traffic Scene Understanding

TL;DR

Spatial CNN (SCNN) introduces a slice-based spatial information diffusion mechanism within CNNs to capture long-range spatial relationships in traffic scenes, addressing the limitations of conventional CNNs for structured objects like lanes and poles. By propagating messages in four directions via a shared kernel and integrating at a top CNN layer, SCNN yields residual, efficient spatial diffusion that improves both lane detection and semantic segmentation, outperforming ReNet, MRfNet, Dense CRF, and even ResNet-101 baselines. The authors validate SCNN on a newly released large-scale lane-detection dataset and on Cityscapes, achieving a TuSimple benchmark-winning accuracy of 96.53% and notable IoU gains for several classes. The approach demonstrates that embedding directional, sequential spatial passes into CNNs can significantly enhance autonomous driving perception with minimal architectural disruption. Overall, SCNN offers a practical, end-to-end-friendly method to leverage spatial priors for both fine-grained and large-area understanding in traffic scenes.

Abstract

Convolutional neural networks (CNNs) are usually built by stacking convolutional operations layer-by-layer. Although CNN has shown strong capability to extract semantics from raw pixels, its capacity to capture spatial relationships of pixels across rows and columns of an image is not fully explored. These relationships are important to learn semantic objects with strong shape priors but weak appearance coherences, such as traffic lanes, which are often occluded or not even painted on the road surface as shown in Fig. 1 (a). In this paper, we propose Spatial CNN (SCNN), which generalizes traditional deep layer-by-layer convolutions to slice-byslice convolutions within feature maps, thus enabling message passings between pixels across rows and columns in a layer. Such SCNN is particular suitable for long continuous shape structure or large objects, with strong spatial relationship but less appearance clues, such as traffic lanes, poles, and wall. We apply SCNN on a newly released very challenging traffic lane detection dataset and Cityscapse dataset. The results show that SCNN could learn the spatial relationship for structure output and significantly improves the performance. We show that SCNN outperforms the recurrent neural network (RNN) based ReNet and MRF+CNN (MRFNet) in the lane detection dataset by 8.7% and 4.6% respectively. Moreover, our SCNN won the 1st place on the TuSimple Benchmark Lane Detection Challenge, with an accuracy of 96.53%.

Paper Structure

This paper contains 13 sections, 1 equation, 8 figures, 8 tables.

Figures (8)

  • Figure 1: Comparison between CNN and SCNN in (a) lane detection and (b) semantic segmentation. For each example, from left to right are: input image, output of CNN, output of SCNN. It can be seen that SCNN could better capture the long continuous shape prior of lane markings and poles and fix the disconnected parts in CNN.
  • Figure 2: (a) Dataset examples for different scenarios. (b) Proportion of each scenario.
  • Figure 3: (a) MRF/CRF based method. (b) Our implementation of Spatial CNN. MRF/CRF are theoretically applied to unary potentials whose channel number equals to the number of classes to be classified, while SCNN could be applied to the top hidden layers with richer information.
  • Figure 4: Message passing directions in (a) dense MRF/CRF and (b) Spatial CNN (rightward). For (a), only message passing to the inner 4 pixels are shown for clearance.
  • Figure 5: (a) Training model, (b) Lane prediction process. 'Conv','HConv', and 'FC' denotes convolution layer, atrous convolution layer chen2016deeplab, and fully connected layer respectively. 'c', 'w', and 'h' denotes number of output channels, kernel width, and 'rate' for atrous convolution.
  • ...and 3 more figures