Spatial As Deep: Spatial CNN for Traffic Scene Understanding
Xingang Pan, Xiaohang Zhan, Jianping Shi, Ping Luo, Xiaogang Wang, Xiaoou Tang
TL;DR
Spatial CNN (SCNN) introduces a slice-based spatial information diffusion mechanism within CNNs to capture long-range spatial relationships in traffic scenes, addressing the limitations of conventional CNNs for structured objects like lanes and poles. By propagating messages in four directions via a shared kernel and integrating at a top CNN layer, SCNN yields residual, efficient spatial diffusion that improves both lane detection and semantic segmentation, outperforming ReNet, MRfNet, Dense CRF, and even ResNet-101 baselines. The authors validate SCNN on a newly released large-scale lane-detection dataset and on Cityscapes, achieving a TuSimple benchmark-winning accuracy of 96.53% and notable IoU gains for several classes. The approach demonstrates that embedding directional, sequential spatial passes into CNNs can significantly enhance autonomous driving perception with minimal architectural disruption. Overall, SCNN offers a practical, end-to-end-friendly method to leverage spatial priors for both fine-grained and large-area understanding in traffic scenes.
Abstract
Convolutional neural networks (CNNs) are usually built by stacking convolutional operations layer-by-layer. Although CNN has shown strong capability to extract semantics from raw pixels, its capacity to capture spatial relationships of pixels across rows and columns of an image is not fully explored. These relationships are important to learn semantic objects with strong shape priors but weak appearance coherences, such as traffic lanes, which are often occluded or not even painted on the road surface as shown in Fig. 1 (a). In this paper, we propose Spatial CNN (SCNN), which generalizes traditional deep layer-by-layer convolutions to slice-byslice convolutions within feature maps, thus enabling message passings between pixels across rows and columns in a layer. Such SCNN is particular suitable for long continuous shape structure or large objects, with strong spatial relationship but less appearance clues, such as traffic lanes, poles, and wall. We apply SCNN on a newly released very challenging traffic lane detection dataset and Cityscapse dataset. The results show that SCNN could learn the spatial relationship for structure output and significantly improves the performance. We show that SCNN outperforms the recurrent neural network (RNN) based ReNet and MRF+CNN (MRFNet) in the lane detection dataset by 8.7% and 4.6% respectively. Moreover, our SCNN won the 1st place on the TuSimple Benchmark Lane Detection Challenge, with an accuracy of 96.53%.
