Table of Contents
Fetching ...

Can Foundation Models Revolutionize Mobile AR Sparse Sensing?

Yiqin Zhao, Tian Guo

TL;DR

This paper tackles the energy–accuracy trade-off in mobile AR by evaluating foundation-model‑driven sparse sensing on real‑world data. It demonstrates that foundation‑model depth estimation markedly improves cross‑frame information reuse via geometry‑aware image warping, yielding about $25.5\%$ RGB SSIM and $30.7\%$ depth SSIM gains over LiDAR depth, and remains robust across larger frame gaps. For long‑duration AR, foundation‑model depth enables substantially better 3D reconstruction under sparse frame inputs, e.g., Hausdorff Distance improves from $0.48$ to $0.25$ with Poisson + ICP, indicating scalable sparse sensing. The work also shows that information overlap evolves nonlinearly and advocates hybrid temporal–spatial sparse‑sensing policies that adapt to user and environment context, enabling mobile AR to sense only when it matters.

Abstract

Mobile sensing systems have long faced a fundamental trade-off between sensing quality and efficiency due to constraints in computation, power, and other limitations. Sparse sensing, which aims to acquire and process only a subset of sensor data, has been a key strategy for maintaining performance under such constraints. However, existing sparse sensing methods often suffer from reduced accuracy, as missing information across space and time introduces uncertainty into many sensing systems. In this work, we investigate whether foundation models can change the landscape of mobile sparse sensing. Using real-world mobile AR data, our evaluations demonstrate that foundation models offer significant improvements in geometry-aware image warping, a central technique for enabling accurate reuse of cross-frame information. Furthermore, our study demonstrates the scalability of foundation model-based sparse sensing and shows its leading performance in 3D scene reconstruction. Collectively, our study reveals critical aspects of the promises and the open challenges of integrating foundation models into mobile sparse sensing systems.

Can Foundation Models Revolutionize Mobile AR Sparse Sensing?

TL;DR

This paper tackles the energy–accuracy trade-off in mobile AR by evaluating foundation-model‑driven sparse sensing on real‑world data. It demonstrates that foundation‑model depth estimation markedly improves cross‑frame information reuse via geometry‑aware image warping, yielding about RGB SSIM and depth SSIM gains over LiDAR depth, and remains robust across larger frame gaps. For long‑duration AR, foundation‑model depth enables substantially better 3D reconstruction under sparse frame inputs, e.g., Hausdorff Distance improves from to with Poisson + ICP, indicating scalable sparse sensing. The work also shows that information overlap evolves nonlinearly and advocates hybrid temporal–spatial sparse‑sensing policies that adapt to user and environment context, enabling mobile AR to sense only when it matters.

Abstract

Mobile sensing systems have long faced a fundamental trade-off between sensing quality and efficiency due to constraints in computation, power, and other limitations. Sparse sensing, which aims to acquire and process only a subset of sensor data, has been a key strategy for maintaining performance under such constraints. However, existing sparse sensing methods often suffer from reduced accuracy, as missing information across space and time introduces uncertainty into many sensing systems. In this work, we investigate whether foundation models can change the landscape of mobile sparse sensing. Using real-world mobile AR data, our evaluations demonstrate that foundation models offer significant improvements in geometry-aware image warping, a central technique for enabling accurate reuse of cross-frame information. Furthermore, our study demonstrates the scalability of foundation model-based sparse sensing and shows its leading performance in 3D scene reconstruction. Collectively, our study reveals critical aspects of the promises and the open challenges of integrating foundation models into mobile sparse sensing systems.

Paper Structure

This paper contains 7 sections, 1 equation, 5 figures.

Figures (5)

  • Figure 1: Experiment environment setup. We utilize ScanNet++ yeshwanth2023scannet++, a state-of-the-art high-quality 3D indoor scene reconstruction dataset, to build our experiment environment. From the dataset, we extract iPhone-based AR session recordings with real-world device mobility and sensor data, as well as laser-scanner-based 3D reconstruction geometries that provide environment sensing ground truth.
  • Figure 2: Qualitative comparison on geometry-aware image warping. We show comparisons on geometry-aware image warping with LiDAR depth, foundation model estimated depth, and ScanNet++ ground truth depth. The time difference between the warping source and the target is 10 frames. We observe that high-quality depth map details estimated by foundation models significantly improve image warping accuracy.
  • Figure 3: Cross-frame information reuse accuracy. For both warped RGB and depth images, image warping based on foundation model–estimated depth consistently yields higher SSIM values. Moreover, the foundation model–based warping demonstrates greater robustness under larger temporal gaps between frames.
  • Figure 4: 3D reconstruction quality measurement. We reconstruct 3D environment meshes with both LiDAR and foundation model-estimated depth and merge multi-view meshes using three different methods. Overall, foundation model-based reconstruction significantly outperforms LiDAR-based methods in terms of Hausdorff Distance$\downarrow$, even under sparse frame inputs.
  • Figure 5: Measurement of frame overlaps. We measure the frame overlaps by calculating the overlap percentage$\uparrow$ of warped pixels between sparse frames. The frames are selected with two policies: time interval-based (a) and geodetic distance-based (b). We observe nonlinearity on information sparsity across both temporal and spatial domains.