Table of Contents
Fetching ...

PointSplit: Towards On-device 3D Object Detection with Heterogeneous Low-power Accelerators

Keondo Park, You Rim Choi, Inhoe Lee, Hyung-Sin Kim

TL;DR

PointSplit tackles the challenge of on-device 3D object detection on edge devices with heterogeneous accelerators by jointly optimizing system and algorithm design. It introduces two parallel set abstraction pipelines guided by 2D semantic information, a semantics-aware biased sampling mechanism, and role-based group-wise quantization to enable efficient, INT8 execution across GPU and NPU. Empirical results on SUN RGB-D and Scannet V2 show dramatic latency reductions (up to 24.7× faster) with preserved accuracy compared to GPU-only baselines, validating the viability of multi-type accelerators for real-time 3D perception. The work also provides an open TensorFlow/TensorFlow Lite implementation and demonstrates the broader potential of heterogeneous-edge platforms for complex vision tasks.

Abstract

Running deep learning models on resource-constrained edge devices has drawn significant attention due to its fast response, privacy preservation, and robust operation regardless of Internet connectivity. While these devices already cope with various intelligent tasks, the latest edge devices that are equipped with multiple types of low-power accelerators (i.e., both mobile GPU and NPU) can bring another opportunity; a task that used to be too heavy for an edge device in the single-accelerator world might become viable in the upcoming heterogeneous-accelerator world.To realize the potential in the context of 3D object detection, we identify several technical challenges and propose PointSplit, a novel 3D object detection framework for multi-accelerator edge devices that addresses the problems. Specifically, our PointSplit design includes (1) 2D semantics-aware biased point sampling, (2) parallelized 3D feature extraction, and (3) role-based group-wise quantization. We implement PointSplit on TensorFlow Lite and evaluate it on a customized hardware platform comprising both mobile GPU and EdgeTPU. Experimental results on representative RGB-D datasets, SUN RGB-D and Scannet V2, demonstrate that PointSplit on a multi-accelerator device is 24.7 times faster with similar accuracy compared to the full-precision, 2D-3D fusion-based 3D detector on a GPU-only device.

PointSplit: Towards On-device 3D Object Detection with Heterogeneous Low-power Accelerators

TL;DR

PointSplit tackles the challenge of on-device 3D object detection on edge devices with heterogeneous accelerators by jointly optimizing system and algorithm design. It introduces two parallel set abstraction pipelines guided by 2D semantic information, a semantics-aware biased sampling mechanism, and role-based group-wise quantization to enable efficient, INT8 execution across GPU and NPU. Empirical results on SUN RGB-D and Scannet V2 show dramatic latency reductions (up to 24.7× faster) with preserved accuracy compared to GPU-only baselines, validating the viability of multi-type accelerators for real-time 3D perception. The work also provides an open TensorFlow/TensorFlow Lite implementation and demonstrates the broader potential of heterogeneous-edge platforms for complex vision tasks.

Abstract

Running deep learning models on resource-constrained edge devices has drawn significant attention due to its fast response, privacy preservation, and robust operation regardless of Internet connectivity. While these devices already cope with various intelligent tasks, the latest edge devices that are equipped with multiple types of low-power accelerators (i.e., both mobile GPU and NPU) can bring another opportunity; a task that used to be too heavy for an edge device in the single-accelerator world might become viable in the upcoming heterogeneous-accelerator world.To realize the potential in the context of 3D object detection, we identify several technical challenges and propose PointSplit, a novel 3D object detection framework for multi-accelerator edge devices that addresses the problems. Specifically, our PointSplit design includes (1) 2D semantics-aware biased point sampling, (2) parallelized 3D feature extraction, and (3) role-based group-wise quantization. We implement PointSplit on TensorFlow Lite and evaluate it on a customized hardware platform comprising both mobile GPU and EdgeTPU. Experimental results on representative RGB-D datasets, SUN RGB-D and Scannet V2, demonstrate that PointSplit on a multi-accelerator device is 24.7 times faster with similar accuracy compared to the full-precision, 2D-3D fusion-based 3D detector on a GPU-only device.

Paper Structure

This paper contains 25 sections, 1 equation, 10 figures, 13 tables.

Figures (10)

  • Figure 1: Target scenario: On-device 3D indoor scene understanding via RGB-D camera. On-device detection provides advantages on privacy, latency, and communication burden. 2D-3D fusion can improve detection accuracy while utilizing both GPU and NPU can accelerate on-device inference speed.
  • Figure 2: Illustration of naïve workload distribution to run the sequential pipeline of PointPainting on a GPU-NPU combined environment. Among the three figures in the input scene, only the triangle shape is assumed to be a valid object (foreground points). Either of the processors is always idle, waiting for the other to finish its job.
  • Figure 3: Illustration of PointSplit's parallelized set abstraction (SA) pipeline. Each lightweight SA process in PointSplit generates only half the number of balls compared to the conventional SA layers in PointNet++. When GPU processes point manipulation for an SA pipeline, NPU processes PointNet for the other SA pipeline in parallel, which reduces idle time on each processor.
  • Figure 4: Illustration of PointSplit's semantics-aware biased point sampling. Using the same point cloud scene, our biased sampling can create significantly different multiple views by controlling the weight coefficient $w_0$.
  • Figure 5: Illustration of PointNet++ structure optimized for PointSplit. (1) An input point cloud is divided into two heterogeneous SA pipelines, one with regular FPS and the other with biased FPS. (2) The two SA pipelines share a single PointNet for data augmentation effect. (3) The two SA pipelines are merged before the fourth SA layer. (4) After SA layers, two FP layers are processed back to back and the last single PointNet produces the final output.
  • ...and 5 more figures