Skip-SCAR: Hardware-Friendly High-Quality Embodied Visual Navigation

Yaotian Liu; Yu Cao; Jeff Zhang

Skip-SCAR: Hardware-Friendly High-Quality Embodied Visual Navigation

Yaotian Liu, Yu Cao, Jeff Zhang

TL;DR

Skip-SCAR addresses the computational bottlenecks of ObjectNav by integrating an adaptive skip semantic mapping module with a SparseConv-Augmented ResNet (SCAR) for target probability prediction. The adaptive mapping enables lossless and aggressive skips to bypass redundant semantic segmentation and replanning, while SCAR drastically reduces memory and FLOPs relative to dense predictors. Evaluations on HM3D ObjectNav and real hardware show Skip-SCAR achieving state-of-the-art navigation quality with large speedups and memory savings, largely due to the SYNERGY of adaptive skipping and sparse-convolution-based prediction. Overall, the approach demonstrates that jointly optimizing navigation performance and computational efficiency can yield practical, scalable robotic navigation systems.

Abstract

In ObjectNav, agents must locate specific objects within unseen environments, requiring effective perception, prediction, localization and planning capabilities. This study finds that state-of-the-art embodied AI agents compete for higher navigation quality, but often compromise the computational efficiency. To address this issue, we introduce "Skip-SCAR," an optimization framework that builds computationally and memory-efficient embodied AI agents to accomplish high-quality visual navigation tasks. Skip-SCAR opportunistically skips the redundant step computations during semantic segmentation and local re-planning without hurting the navigation quality. Skip-SCAR also adopts a novel hybrid sparse and dense network for object prediction, optimizing both the computation and memory footprint. Tested on the HM3D ObjectNav datasets and real-world physical hardware systems, Skip-SCAR not only minimizes hardware resources but also sets new performance benchmarks, demonstrating the benefits of optimizing both navigation quality and computational efficiency for robotics.

Skip-SCAR: Hardware-Friendly High-Quality Embodied Visual Navigation

TL;DR

Abstract

Paper Structure (13 sections, 1 equation, 9 figures, 4 tables, 1 algorithm)

This paper contains 13 sections, 1 equation, 9 figures, 4 tables, 1 algorithm.

Introduction
Related Works
Skip-SCAR Approach
Adaptive Skip Semantic Mapping Design
Lossless skips
Aggressive skips
SparseConv-Augmented ResNet (SCAR) based High-Accuracy Target Probability Predictor
Prediction-Based Goal Selection and Local Planning
Evaluation
Adaptive Skip Semantic Mapping
The Analysis and Performance of SCAR
End-to-End Performance for Skip-SCAR
Conclusion

Figures (9)

Figure 1: Overview of Skip-SCAR. (a) At each step, the agent’s RGB-D observation and pose observations are used to update the incomplete global semantic map. This map is then used to predict a target object probability map, which is used to select long-term goals. Finally, an analytical local planner is employed to calculate the low-level actions necessary to reach the goal. (b) Bar plot of a single step computation time breakdown for Adaptive Skip vs. PEANUT on GPU system. The skip of 40.2% of semantic segmentation and 5.5% of local planning results in a corresponding reduction in their computation time. As target prediction and global goal update occur every 10 steps, we divide their respective times by ten. (c) Bar plot of a single step memory consumption breakdown for SCAR vs. PEANUT on GPU system. SCAR reduces the memory footprint by 70%.
Figure 2: Schematics of candidate skip scenarios, along with the agent's navigation trajectory.
Figure 3: Approximate revisiting skip under different revisit radius ($r$). SPL and skip ratio are normalized to no-skip baseline and averaged across 500 episodes in training set.
Figure 4: SparseConv-Augmented ResNet (SCAR). This is an example architecture SCAR-18-50. The SparseResNet-50 model follows the ResNet-50 architecture, with modifications where standard downsampling is replaced by convolutions with a kernel size of 3. The strided block indicates a stride size of 2 in the first block of each ResLayer, with no stride in subsequent blocks. Compression layers are $1 \times 1$ convolutions to align the channels of the sparse and dense. Sparse features are converted to dense format post-compression and fused with dense features. An auxiliary head aids training; only the decode head is active during inference. SCAR takes $\mathbf{m}_t$ and outputs the target object prediction $\mathbf{y}_t \in \mathbb{R}^{C \times H \times W}$.
Figure 5: Visualization of a sample target prediction output (e.g., a toilet) from different model candidates.
...and 4 more figures

Skip-SCAR: Hardware-Friendly High-Quality Embodied Visual Navigation

TL;DR

Abstract

Skip-SCAR: Hardware-Friendly High-Quality Embodied Visual Navigation

Authors

TL;DR

Abstract

Table of Contents

Figures (9)