FPGA-Accelerated Correspondence-free Point Cloud Registration with PointNet Features
Keisuke Sugiura, Hiroki Matsutani
TL;DR
This work addresses fast, accurate 3D point cloud registration on resource-constrained edge devices by introducing FPGA-accelerated, correspondence-free methods. It combines a streamlined PointNet feature extractor with two dedicated IP cores, PointLKCore for PointNetLK and ReAgentCore for ReAgent, leveraging LLT-based quantization to keep all network parameters on-chip. The design achieves substantial speedups (tens of times faster than CPU and embedded GPUs) and dramatic energy efficiency while maintaining competitive accuracy, even under noise and large initial misalignments. The approach enables real-time registration on low-power hardware and demonstrates strong generalization to unseen categories and real-world scans, with design-space exploration guiding optimal FPGA implementations.
Abstract
Point cloud registration serves as a basis for vision and robotic applications including 3D reconstruction and mapping. Despite significant improvements on the quality of results, recent deep learning approaches are computationally expensive and power-hungry, making them difficult to deploy on resource-constrained edge devices. To tackle this problem, in this paper, we propose a fast, accurate, and robust registration for low-cost embedded FPGAs. Based on a parallel and pipelined PointNet feature extractor, we develop custom accelerator cores namely PointLKCore and ReAgentCore, for two different learning-based methods. They are both correspondence-free and computationally efficient as they avoid the costly feature matching step involving nearest-neighbor search. The proposed cores are implemented on the Xilinx ZCU104 board and evaluated using both synthetic and real-world datasets, showing the substantial improvements in the trade-offs between runtime and registration quality. They run 44.08-45.75x faster than ARM Cortex-A53 CPU and offer 1.98-11.13x speedups over Intel Xeon CPU and Nvidia Jetson boards, while consuming less than 1W and achieving 163.11-213.58x energy-efficiency compared to Nvidia GeForce GPU. The proposed cores are more robust to noise and large initial misalignments than the classical methods and quickly find reasonable solutions in less than 15ms, demonstrating the real-time performance.
