Table of Contents
Fetching ...

LiLMaps: Learnable Implicit Language Maps

Evgenii Kruzhkov, Sven Behnke

TL;DR

LiLMaps addresses the challenge of coupling language understanding with 3D environment maps for autonomous robotics by learning implicit language representations alongside geometry in an incremental setting. It introduces a sparse octree-based implicit map with a compact per-voxel feature and a 3-layer MLP decoder that reconstructs per-point language features, trained with a vision-language cosine loss $L_{vl}$ and with decoder weights kept separate from this loss. To handle unseen language features and cross-view inconsistencies, LiLMaps adds adaptive language decoder optimization and a measurement update strategy using a weighted target $\varphi_n^*$ and exponential smoothing with $\alpha$ to adapt to noise. Experiments on Habitat with Matterport3D show LiLMaps outperforming baselines like OpenScene and VLMaps in 3D language mapping quality while running in real time (~4 fps) and supporting 3D language-based object detection. This work offers a robust, scalable approach to language-grounded mapping that can be integrated with existing implicit SLAM systems with minimal overhead.

Abstract

One of the current trends in robotics is to employ large language models (LLMs) to provide non-predefined command execution and natural human-robot interaction. It is useful to have an environment map together with its language representation, which can be further utilized by LLMs. Such a comprehensive scene representation enables numerous ways of interaction with the map for autonomously operating robots. In this work, we present an approach that enhances incremental implicit mapping through the integration of vision-language features. Specifically, we (i) propose a decoder optimization technique for implicit language maps which can be used when new objects appear on the scene, and (ii) address the problem of inconsistent vision-language predictions between different viewing positions. Our experiments demonstrate the effectiveness of LiLMaps and solid improvements in performance.

LiLMaps: Learnable Implicit Language Maps

TL;DR

LiLMaps addresses the challenge of coupling language understanding with 3D environment maps for autonomous robotics by learning implicit language representations alongside geometry in an incremental setting. It introduces a sparse octree-based implicit map with a compact per-voxel feature and a 3-layer MLP decoder that reconstructs per-point language features, trained with a vision-language cosine loss and with decoder weights kept separate from this loss. To handle unseen language features and cross-view inconsistencies, LiLMaps adds adaptive language decoder optimization and a measurement update strategy using a weighted target and exponential smoothing with to adapt to noise. Experiments on Habitat with Matterport3D show LiLMaps outperforming baselines like OpenScene and VLMaps in 3D language mapping quality while running in real time (~4 fps) and supporting 3D language-based object detection. This work offers a robust, scalable approach to language-grounded mapping that can be integrated with existing implicit SLAM systems with minimal overhead.

Abstract

One of the current trends in robotics is to employ large language models (LLMs) to provide non-predefined command execution and natural human-robot interaction. It is useful to have an environment map together with its language representation, which can be further utilized by LLMs. Such a comprehensive scene representation enables numerous ways of interaction with the map for autonomously operating robots. In this work, we present an approach that enhances incremental implicit mapping through the integration of vision-language features. Specifically, we (i) propose a decoder optimization technique for implicit language maps which can be used when new objects appear on the scene, and (ii) address the problem of inconsistent vision-language predictions between different viewing positions. Our experiments demonstrate the effectiveness of LiLMaps and solid improvements in performance.
Paper Structure (10 sections, 5 equations, 7 figures, 3 tables, 1 algorithm)

This paper contains 10 sections, 5 equations, 7 figures, 3 tables, 1 algorithm.

Figures (7)

  • Figure 1: Reconstructed implicit language map built with LiLMaps. Semantic colors are assigned based on the similarity of reconstructed language features and CLIPradford2021learning encodings of semantic categories from the Matterport3D datasetMatterport3D.
  • Figure 2: Implicit language mapping. Vision-language features $\varphi$ are extracted from the RGB image. The corresponding points of the depth image are projected to the world coordinate system. Each point can be encoded using its coordinates and octree: the coordinates are used to find the corresponding octree voxels (blue, red, green); learnable features stored in the voxels' corners are interpolated and summed, producing the point encoding. $F$ vectors are stored only in the voxels of the coarse octree level (blue). The language decoder reconstructs the language feature $\bar{\varphi}$ in the spatial coordinates of the point based on its encoding and the vector $F$. Language loss optimizes the learnable features and $F$ vectors. After optimization, the language map can be reconstructed in arbitrary spatial coordinates. The language detector is optimized independently of the implicit mapping (\ref{['sec:adaptive-optimization']}).
  • Figure 3: Left: Environments reconstructed without measurement update; Middle: Ground Truth; Right: Environments reconstructed with measurement update.
  • Figure 4: Left: Language map produced by OpenScene 3D peng2023openscene; Middle: Ground Truth; Right: Language map created by LiLMaps.
  • Figure 5: Language map incrementally created with our adaptive optimization. Bottom Left: A region mapped in the beginning. Bottom Right: The same region after the mapping is completed. All initially mapped objects remain unchanged.
  • ...and 2 more figures