Efficient stereo matching on embedded GPUs with zero-means cross correlation
Qiong Chang, Aolong Zha, Weimin Wang, Xin Liu, Masaki Onishi, Lei Lei, Meng Joo Er, Tsutomu Maruyama
TL;DR
The paper tackles the challenge of delivering high-accuracy stereo matching on resource-constrained mobile GPUs. It introduces ZNCC-based cost computation with a novel Z$^2$-ZNCC method that uses zigzag scanning to reuse computations and reduce memory transfers, together with a two-step DT-based cost aggregation and a data-width reduction strategy for real-time performance on Jetson TX2. Key contributions include two parallel summation methods, an efficient zigzag-based summation pipeline, and FastDT integration that enables 32 fps real-time operation for 1,280×384 imagery with disparities up to 128, while achieving superior KITTI 2015 accuracy compared to census-based baselines. The approach demonstrates a favorable speed-accuracy trade-off on embedded GPUs and remains applicable across platforms beyond GPUs, including FPGAs. Overall, the work delivers a practical, high-accuracy, real-time stereo solution for mobile and embedded systems.
Abstract
Mobile stereo-matching systems have become an important part of many applications, such as automated-driving vehicles and autonomous robots. Accurate stereo-matching methods usually lead to high computational complexity; however, mobile platforms have only limited hardware resources to keep their power consumption low; this makes it difficult to maintain both an acceptable processing speed and accuracy on mobile platforms. To resolve this trade-off, we herein propose a novel acceleration approach for the well-known zero-means normalized cross correlation (ZNCC) matching cost calculation algorithm on a Jetson Tx2 embedded GPU. In our method for accelerating ZNCC, target images are scanned in a zigzag fashion to efficiently reuse one pixel's computation for its neighboring pixels; this reduces the amount of data transmission and increases the utilization of on-chip registers, thus increasing the processing speed. As a result, our method is 2X faster than the traditional image scanning method, and 26% faster than the latest NCC method. By combining this technique with the domain transformation (DT) algorithm, our system show real-time processing speed of 32 fps, on a Jetson Tx2 GPU for 1,280x384 pixel images with a maximum disparity of 128. Additionally, the evaluation results on the KITTI 2015 benchmark show that our combined system is more accurate than the same algorithm combined with census by 7.26%, while maintaining almost the same processing speed. Source Code: https://github.com/changqiong/Z2ZNCC.git
