Table of Contents
Fetching ...

VoxDepth: Rectification of Depth Images on Edge Devices

Yashashwee Chakrabarty, Smruti Ranjan Sarangi

TL;DR

Depth perception on lightweight robotics often suffers from flickering pixels and algorithmic holes, which degrade SLAM and collision avoidance. VoxDepth combines a fast 3D point cloud fusion to create a stable scene template, projects this to a 2D depth template, and uses a pipelined, non-ML correction workflow with ORB-based registration to rectify incoming frames in real time. It demonstrates a 31% PSNR improvement on real-world data and 25% improvement in occlusion-focused masked RMSE, while achieving 26.7–27 FPS on a Jetson Nano with lower power consumption than competing approaches. The approach enables robust depth rectification on edge devices, supporting reliable SLAM, object detection, and swarming in practical robotic scenarios.

Abstract

Autonomous mobile robots like self-flying drones and industrial robots heavily depend on depth images to perform tasks such as 3D reconstruction and visual SLAM. However, the presence of inaccuracies in these depth images can greatly hinder the effectiveness of these applications, resulting in sub-optimal results. Depth images produced by commercially available cameras frequently exhibit noise, which manifests as flickering pixels and erroneous patches. ML-based methods to rectify these images are unsuitable for edge devices that have very limited computational resources. Non-ML methods are much faster but have limited accuracy, especially for correcting errors that are a result of occlusion and camera movement. We propose a scheme called VoxDepth that is fast, accurate, and runs very well on edge devices. It relies on a host of novel techniques: 3D point cloud construction and fusion, and using it to create a template that can fix erroneous depth images. VoxDepth shows superior results on both synthetic and real-world datasets. We demonstrate a 31% improvement in quality as compared to state-of-the-art methods on real-world depth datasets, while maintaining a competitive framerate of 27 FPS (frames per second).

VoxDepth: Rectification of Depth Images on Edge Devices

TL;DR

Depth perception on lightweight robotics often suffers from flickering pixels and algorithmic holes, which degrade SLAM and collision avoidance. VoxDepth combines a fast 3D point cloud fusion to create a stable scene template, projects this to a 2D depth template, and uses a pipelined, non-ML correction workflow with ORB-based registration to rectify incoming frames in real time. It demonstrates a 31% PSNR improvement on real-world data and 25% improvement in occlusion-focused masked RMSE, while achieving 26.7–27 FPS on a Jetson Nano with lower power consumption than competing approaches. The approach enables robust depth rectification on edge devices, supporting reliable SLAM, object detection, and swarming in practical robotic scenarios.

Abstract

Autonomous mobile robots like self-flying drones and industrial robots heavily depend on depth images to perform tasks such as 3D reconstruction and visual SLAM. However, the presence of inaccuracies in these depth images can greatly hinder the effectiveness of these applications, resulting in sub-optimal results. Depth images produced by commercially available cameras frequently exhibit noise, which manifests as flickering pixels and erroneous patches. ML-based methods to rectify these images are unsuitable for edge devices that have very limited computational resources. Non-ML methods are much faster but have limited accuracy, especially for correcting errors that are a result of occlusion and camera movement. We propose a scheme called VoxDepth that is fast, accurate, and runs very well on edge devices. It relies on a host of novel techniques: 3D point cloud construction and fusion, and using it to create a template that can fix erroneous depth images. VoxDepth shows superior results on both synthetic and real-world datasets. We demonstrate a 31% improvement in quality as compared to state-of-the-art methods on real-world depth datasets, while maintaining a competitive framerate of 27 FPS (frames per second).
Paper Structure (52 sections, 16 equations, 16 figures, 10 tables, 1 algorithm)

This paper contains 52 sections, 16 equations, 16 figures, 10 tables, 1 algorithm.

Figures (16)

  • Figure 1: Visual representation of algorithmic holes (green ellipse) and flickering noise (yellow ellipse) across two frames. The flickering noise appears and disappears across frames, but algorithmic noise persists across frames as long as the object remains present.
  • Figure 2: Visual representation of the stereoscopic depth estimation method
  • Figure 3: Visual representation of algorithmic noise in an image from the Mid-air dataset fonder2019mid. It is shown using a green box.
  • Figure 4: Fused point cloud generated after the fusion step. The fusion process fills up holes in the point cloud.
  • Figure 5: Depth images from different datasets used in this work
  • ...and 11 more figures