Table of Contents
Fetching ...

SSR: A Generic Framework for Text-Aided Map Compression for Localization

Mohammad Omama, Po-han Li, Harsh Goel, Minkyu Choi, Behdad Chalaki, Vaishnav Tadiparthi, Hossein Nourkhiz Mahjoub, Ehsan Moradi Pari, Sandeep P. Chinchali

TL;DR

A text-enhanced compression framework that reduces both memory and bandwidth footprints while retaining high-fidelity localization and validated on multiple downstream localization tasks, including Visual Place Recognition as well as object-centric Monte Carlo localization in both indoor and outdoor settings.

Abstract

Mapping is crucial in robotics for localization and downstream decision-making. As robots are deployed in ever-broader settings, the maps they rely on continue to increase in size. However, storing these maps indefinitely (cold storage), transferring them across networks, or sending localization queries to cloud-hosted maps imposes prohibitive memory and bandwidth costs. We propose a text-enhanced compression framework that reduces both memory and bandwidth footprints while retaining high-fidelity localization. The key idea is to treat text as an alternative modality: one that can be losslessly compressed with large language models. We propose leveraging lightweight text descriptions combined with very small image feature vectors, which capture "complementary information" as a compact representation for the mapping task. Building on this, our novel technique, Similarity Space Replication (SSR), learns an adaptive image embedding in one shot that captures only the information "complementary" to the text descriptions. We validate our compression framework on multiple downstream localization tasks, including Visual Place Recognition as well as object-centric Monte Carlo localization in both indoor and outdoor settings. SSR achieves 2 times better compression than competing baselines on state-of-the-art datasets, including TokyoVal, Pittsburgh30k, Replica, and KITTI.

SSR: A Generic Framework for Text-Aided Map Compression for Localization

TL;DR

A text-enhanced compression framework that reduces both memory and bandwidth footprints while retaining high-fidelity localization and validated on multiple downstream localization tasks, including Visual Place Recognition as well as object-centric Monte Carlo localization in both indoor and outdoor settings.

Abstract

Mapping is crucial in robotics for localization and downstream decision-making. As robots are deployed in ever-broader settings, the maps they rely on continue to increase in size. However, storing these maps indefinitely (cold storage), transferring them across networks, or sending localization queries to cloud-hosted maps imposes prohibitive memory and bandwidth costs. We propose a text-enhanced compression framework that reduces both memory and bandwidth footprints while retaining high-fidelity localization. The key idea is to treat text as an alternative modality: one that can be losslessly compressed with large language models. We propose leveraging lightweight text descriptions combined with very small image feature vectors, which capture "complementary information" as a compact representation for the mapping task. Building on this, our novel technique, Similarity Space Replication (SSR), learns an adaptive image embedding in one shot that captures only the information "complementary" to the text descriptions. We validate our compression framework on multiple downstream localization tasks, including Visual Place Recognition as well as object-centric Monte Carlo localization in both indoor and outdoor settings. SSR achieves 2 times better compression than competing baselines on state-of-the-art datasets, including TokyoVal, Pittsburgh30k, Replica, and KITTI.
Paper Structure (19 sections, 5 equations, 9 figures, 2 tables)

This paper contains 19 sections, 5 equations, 9 figures, 2 tables.

Figures (9)

  • Figure 1: Map compression pipeline and downstream use-cases. The mapping robot (1) first creates a standard map of the environment (2), which is then converted into a feature map (3). Our proposed compression framework (4) processes these feature maps into highly compressed text and complementary feature vectors (5). The compressed map can be stored in cold storage (6), uploaded to a server (7) for remote queries, or transmitted to another robot (8). During inference, if the robot has the compressed map locally (8), standard localization is performed. Otherwise, the input image (10) is compressed by the robot (9), sent to the server, decompressed, and localized, with the resulting pose returned to the robot.
  • Figure 2: Highly compressible text descriptions combined with "complementary" image information are enough for effective localization. Text descriptions (highly compressible) are good enough to discard the bottom two candidates in the reference set for the given query but struggle to distinguish between the top two. Integrating complementary details from the image, like whether the building tapers, ensures a precise match.
  • Figure 3: Pipeline for SSR.(A)SSR (detailed in Sec. \ref{['sec:method_ssr']}) learns adaptive embeddings (green) from the image embeddings (brown) that capture complementary information to the text embeddings (blue). (B) During inference, SSR projects the image embedding to any desired dimension based on the constraints. In parallel, captions are generated (Sec. \ref{['sec:method_vlm']}) and then compressed with LLMZip (Sec. \ref{['sec:method_llmzip']}). The projected complementary feature vector and LLMZipped text are then stored or transmitted. (C)SSR (green) outperforms all competing compression baselines (complete results in Sec. \ref{['sec:exp_results']}).
  • Figure 4: SSR is highly effective for map compression in VPR settings We show the place recognition performance of all approaches across various compression levels on the TokyoVal and Pittsburgh datasets, using Dino, DinoV2, and ViT embeddings. SSR (green) consistently outperforms all baselines, particularly at smaller memory footprints.
  • Figure 5: SSR generalizes effectively to Monte Carlo localization tasks. We show the localization performance (absolute position error) of all approaches across various compression levels on two Replica rooms and two KITTI sequences. SSR (green) consistently achieves lower localization error than competing baselines.
  • ...and 4 more figures