Table of Contents
Fetching ...

Efficient Depth Estimation for Unstable Stereo Camera Systems on AR Glasses

Yongfan Liu, Hyoukjun Kwon

TL;DR

This work tackles the latency bottlenecks of stereo depth estimation on AR glasses by eliminating online rectification through a homography-prediction pathway and by replacing the traditional cost volume with a hardware-friendly multi-head approach based on LayerNorm–DOT product approximations. The authors introduce MultiHeadDepth, a lighter, more efficient depth estimator, and HomoDepth, which processes unrectified inputs via a shared encoder and a homography head with 2D rectification positional encoding. Through extensive experiments on SceneFlow, ADT, and DTU, the methods yield substantial accuracy improvements (up to 30.3% relative gains) and end-to-end latency reductions (up to ~44%) on realistic AR hardware, including edge devices, with robust performance under misalignment. The work provides practical, hardware-aware strategies that can enhance real-time AR depth pipelines and can complement existing stereo-depth models through cross-compatibility and multi-task training.

Abstract

Stereo depth estimation is a fundamental component in augmented reality (AR), which requires low latency for real-time processing. However, preprocessing such as rectification and non-ML computations such as cost volume require significant amount of latency exceeding that of an ML model itself, which hinders the real-time processing required by AR. Therefore, we develop alternative approaches to the rectification and cost volume that consider ML acceleration (GPU and NPUs) in recent hardware. For pre-processing, we eliminate it by introducing homography matrix prediction network with a rectification positional encoding (RPE), which delivers both low latency and robustness to unrectified images. For cost volume, we replace it with a group-pointwise convolution-based operator and approximation of cosine similarity based on layernorm and dot product. Based on our approaches, we develop MultiHeadDepth (replacing cost volume) and HomoDepth (MultiHeadDepth + removing pre-processing) models. MultiHeadDepth provides 11.8-30.3% improvements in accuracy and 22.9-25.2% reduction in latency compared to a state-of-the-art depth estimation model for AR glasses from industry. HomoDepth, which can directly process unrectified images, reduces the end-to-end latency by 44.5%. We also introduce a multi-task learning method to handle misaligned stereo inputs on HomoDepth, which reduces the AbsRel error by 10.0-24.3%. The overall results demonstrate the efficacy of our approaches, which not only reduce the inference latency but also improve the model performance. Our code is available at https://github.com/UCI-ISA-Lab/MultiHeadDepth-HomoDepth

Efficient Depth Estimation for Unstable Stereo Camera Systems on AR Glasses

TL;DR

This work tackles the latency bottlenecks of stereo depth estimation on AR glasses by eliminating online rectification through a homography-prediction pathway and by replacing the traditional cost volume with a hardware-friendly multi-head approach based on LayerNorm–DOT product approximations. The authors introduce MultiHeadDepth, a lighter, more efficient depth estimator, and HomoDepth, which processes unrectified inputs via a shared encoder and a homography head with 2D rectification positional encoding. Through extensive experiments on SceneFlow, ADT, and DTU, the methods yield substantial accuracy improvements (up to 30.3% relative gains) and end-to-end latency reductions (up to ~44%) on realistic AR hardware, including edge devices, with robust performance under misalignment. The work provides practical, hardware-aware strategies that can enhance real-time AR depth pipelines and can complement existing stereo-depth models through cross-compatibility and multi-task training.

Abstract

Stereo depth estimation is a fundamental component in augmented reality (AR), which requires low latency for real-time processing. However, preprocessing such as rectification and non-ML computations such as cost volume require significant amount of latency exceeding that of an ML model itself, which hinders the real-time processing required by AR. Therefore, we develop alternative approaches to the rectification and cost volume that consider ML acceleration (GPU and NPUs) in recent hardware. For pre-processing, we eliminate it by introducing homography matrix prediction network with a rectification positional encoding (RPE), which delivers both low latency and robustness to unrectified images. For cost volume, we replace it with a group-pointwise convolution-based operator and approximation of cosine similarity based on layernorm and dot product. Based on our approaches, we develop MultiHeadDepth (replacing cost volume) and HomoDepth (MultiHeadDepth + removing pre-processing) models. MultiHeadDepth provides 11.8-30.3% improvements in accuracy and 22.9-25.2% reduction in latency compared to a state-of-the-art depth estimation model for AR glasses from industry. HomoDepth, which can directly process unrectified images, reduces the end-to-end latency by 44.5%. We also introduce a multi-task learning method to handle misaligned stereo inputs on HomoDepth, which reduces the AbsRel error by 10.0-24.3%. The overall results demonstrate the efficacy of our approaches, which not only reduce the inference latency but also improve the model performance. Our code is available at https://github.com/UCI-ISA-Lab/MultiHeadDepth-HomoDepth

Paper Structure

This paper contains 32 sections, 13 equations, 9 figures, 6 tables, 1 algorithm.

Figures (9)

  • Figure 1: The latency breakdown analysis of a SOTA model Argoswang2023practical and ours, MultiHeadDepth & HomoDepth on Intel i7-12700H laptop CPU. The "RPE" refers to the 2D rectification position encoding process. "Others" refers to all the other parts of the neural network excluding cost volume blocks, such as Conv, Norm, FC, and ReLU6.
  • Figure 2: Similarity estimation methodology Comparison. Each represents a similarity map between left and right features after applying roll operation (offset: 10) to the input stereo images. The maps are average results across the entire dataset of Sceneflow Sceneflow, after rescaling to [0,1] range. The strips on the left side of maps are caused by roll and they are part of the maps.
  • Figure 3: The comparison between the original cost volume and our approach, MultiheadCostVolume
  • Figure 4: The structure of MultiHeadDepth. The dashed lines indicate that the input activations from the left image are passed to the decoders. The input example is from ADT dataset.
  • Figure 5: Structure of HomoDepth. The input example is from DTU dataset.
  • ...and 4 more figures