Table of Contents
Fetching ...

DPVO-QAT++: Heterogeneous QAT and CUDA Kernel Fusion for High-Performance Deep Patch Visual Odometry

Cheng Liao

TL;DR

DPVO-QAT++ addresses the deployment gap of deep visual odometry by introducing a heterogeneous precision framework that applies geometry-preserving scale-only quantization to the front-end while keeping the back-end in full precision. It combines offline scale-learning via teacher-student distillation with online GPU-native fusion of fake-quantization operations, resulting in significant speedups and memory reductions without sacrificing trajectory accuracy. The approach is validated on EuRoC and TartanAir, showing substantial improvements in FPS and latency and reductions in peak memory, while preserving ATE comparable to the baseline. This hardware-software co-design offers a practical engineering paradigm for running high-accuracy deep VO on resource-constrained embedded platforms and can be generalized to other deep front-end/classical back-end perception pipelines.

Abstract

Deep learning-based Visual SLAM (vSLAM) systems exhibit exceptional geometric reasoning capabilities, yet their prohibitive computational overhead severely restricts deployment on resource-constrained autonomous platforms. This paper presents a hierarchical quantization optimization framework, DPVO-QAT++ (DPVO-QAT++: Heterogeneous QAT and CUDA Kernel Fusion for High-Performance Deep Patch Visual Odometry). Through the synergistic integration of learnable scale parameterization, a heterogeneous precision design for the Visual Odometry (VO) front-end and back-end (front-end floating-point fake quantization with FP16/FP32; back-end full precision), and GPU-native kernel fusion for fake quantization (custom CUDA kernels), our framework significantly reduces memory footprint and increases processing speed while preserving the trajectory accuracy of the original model. On the TartanAir dataset, our framework achieves an average FPS increase of 52.1%, a 29.1% reduction in median latency, and a 64.9% reduction in peak GPU memory reservation, while maintaining trajectory accuracy (ATE) comparable to the original DPVO model across 32 validation sequences. On the EuRoC dataset, it realizes an average FPS increase of 30.1%, a 23.1% reduction in median latency, and a 37.7% reduction in peak GPU memory reservation, maintaining comparable trajectory accuracy (ATE) across 11 validation sequences. Experimental results demonstrate that DPVO-QAT++ effectively bridges the gap between high-precision deep VO and the efficiency requirements for practical deployment, offering a viable engineering paradigm for the application of this technology on real-world embedded platforms. Keywords: Visual Odometry, Heterogeneous Precision Architecture, Quantization-Aware Training, CUDA Kernel Fusion, Scale-Only Training, Deep Patch Visual Odometry, GPU-Native Kernel Fusion.

DPVO-QAT++: Heterogeneous QAT and CUDA Kernel Fusion for High-Performance Deep Patch Visual Odometry

TL;DR

DPVO-QAT++ addresses the deployment gap of deep visual odometry by introducing a heterogeneous precision framework that applies geometry-preserving scale-only quantization to the front-end while keeping the back-end in full precision. It combines offline scale-learning via teacher-student distillation with online GPU-native fusion of fake-quantization operations, resulting in significant speedups and memory reductions without sacrificing trajectory accuracy. The approach is validated on EuRoC and TartanAir, showing substantial improvements in FPS and latency and reductions in peak memory, while preserving ATE comparable to the baseline. This hardware-software co-design offers a practical engineering paradigm for running high-accuracy deep VO on resource-constrained embedded platforms and can be generalized to other deep front-end/classical back-end perception pipelines.

Abstract

Deep learning-based Visual SLAM (vSLAM) systems exhibit exceptional geometric reasoning capabilities, yet their prohibitive computational overhead severely restricts deployment on resource-constrained autonomous platforms. This paper presents a hierarchical quantization optimization framework, DPVO-QAT++ (DPVO-QAT++: Heterogeneous QAT and CUDA Kernel Fusion for High-Performance Deep Patch Visual Odometry). Through the synergistic integration of learnable scale parameterization, a heterogeneous precision design for the Visual Odometry (VO) front-end and back-end (front-end floating-point fake quantization with FP16/FP32; back-end full precision), and GPU-native kernel fusion for fake quantization (custom CUDA kernels), our framework significantly reduces memory footprint and increases processing speed while preserving the trajectory accuracy of the original model. On the TartanAir dataset, our framework achieves an average FPS increase of 52.1%, a 29.1% reduction in median latency, and a 64.9% reduction in peak GPU memory reservation, while maintaining trajectory accuracy (ATE) comparable to the original DPVO model across 32 validation sequences. On the EuRoC dataset, it realizes an average FPS increase of 30.1%, a 23.1% reduction in median latency, and a 37.7% reduction in peak GPU memory reservation, maintaining comparable trajectory accuracy (ATE) across 11 validation sequences. Experimental results demonstrate that DPVO-QAT++ effectively bridges the gap between high-precision deep VO and the efficiency requirements for practical deployment, offering a viable engineering paradigm for the application of this technology on real-world embedded platforms. Keywords: Visual Odometry, Heterogeneous Precision Architecture, Quantization-Aware Training, CUDA Kernel Fusion, Scale-Only Training, Deep Patch Visual Odometry, GPU-Native Kernel Fusion.

Paper Structure

This paper contains 37 sections, 5 equations, 1 figure, 6 tables.

Figures (1)

  • Figure 1: Layered deployment and optimization framework for DPVO. Panel A shows the main inference flow (A1$\rightarrow$A6): from input preprocessing to the quantized front-end Patchifier (floating-point simulation of scale-learning QAT), followed by feature extraction and correlation, then front-end update, then the FP32 geometric back end (BA/reprojection/map), and finally evaluation and logging. Under A2, two implementations of the internal fake-quantized convolution (Conv2d) are shown: left, CUDA-fused (compute scales $\rightarrow$ fake-quantize activations $\rightarrow$ fake-quantize weights $\rightarrow$ convolution) and right, per-operator Python (compute scales $\rightarrow$ STE fake-quantization $\rightarrow$F.conv2d). Panel B depicts the offline QAT training pipeline, where only the scale parameters for weights and activations (log_w_scale/log_a_scale) are learned and then injected into the Patchifier at evaluation time. Color coding: green denotes the front-end QAT/fake-quant; ochre denotes the native DPVO CUDA back end; blue denotes I/O and evaluation; gray denotes loss/optimization.