Table of Contents
Fetching ...

LatentAM: Real-Time, Large-Scale Latent Gaussian Attention Mapping via Online Dictionary Learning

Junwoon Lee, Yulun Tian

TL;DR

LatentAM is presented, an online 3D Gaussian Splatting (3DGS) mapping framework that builds scalable latent feature maps from streaming RGB-D observations for open-vocabulary robotic perception and proposes an online dictionary learning approach that is both model-agnostic and pretraining-free.

Abstract

We present LatentAM, an online 3D Gaussian Splatting (3DGS) mapping framework that builds scalable latent feature maps from streaming RGB-D observations for open-vocabulary robotic perception. Instead of distilling high-dimensional Vision-Language Model (VLM) embeddings using model-specific decoders, LatentAM proposes an online dictionary learning approach that is both model-agnostic and pretraining-free, enabling plug-and-play integration with different VLMs at test time. Specifically, our approach associates each Gaussian primitive with a compact query vector that can be converted into approximate VLM embeddings using an attention mechanism with a learnable dictionary. The dictionary is initialized efficiently from streaming observations and optimized online to adapt to evolving scene semantics under trust-region regularization. To scale to long trajectories and large environments, we further propose an efficient map management strategy based on voxel hashing, where optimization is restricted to an active local map on the GPU, while the global map is stored and indexed on the CPU to maintain bounded GPU memory usage. Experiments on public benchmarks and a large-scale custom dataset demonstrate that LatentAM attains significantly better feature reconstruction fidelity compared to state-of-the-art methods, while achieving near-real-time speed (12-35 FPS) on the evaluated datasets. Our project page is at: https://junwoonlee.github.io/projects/LatentAM

LatentAM: Real-Time, Large-Scale Latent Gaussian Attention Mapping via Online Dictionary Learning

TL;DR

LatentAM is presented, an online 3D Gaussian Splatting (3DGS) mapping framework that builds scalable latent feature maps from streaming RGB-D observations for open-vocabulary robotic perception and proposes an online dictionary learning approach that is both model-agnostic and pretraining-free.

Abstract

We present LatentAM, an online 3D Gaussian Splatting (3DGS) mapping framework that builds scalable latent feature maps from streaming RGB-D observations for open-vocabulary robotic perception. Instead of distilling high-dimensional Vision-Language Model (VLM) embeddings using model-specific decoders, LatentAM proposes an online dictionary learning approach that is both model-agnostic and pretraining-free, enabling plug-and-play integration with different VLMs at test time. Specifically, our approach associates each Gaussian primitive with a compact query vector that can be converted into approximate VLM embeddings using an attention mechanism with a learnable dictionary. The dictionary is initialized efficiently from streaming observations and optimized online to adapt to evolving scene semantics under trust-region regularization. To scale to long trajectories and large environments, we further propose an efficient map management strategy based on voxel hashing, where optimization is restricted to an active local map on the GPU, while the global map is stored and indexed on the CPU to maintain bounded GPU memory usage. Experiments on public benchmarks and a large-scale custom dataset demonstrate that LatentAM attains significantly better feature reconstruction fidelity compared to state-of-the-art methods, while achieving near-real-time speed (12-35 FPS) on the evaluated datasets. Our project page is at: https://junwoonlee.github.io/projects/LatentAM
Paper Structure (16 sections, 10 equations, 5 figures, 4 tables, 1 algorithm)

This paper contains 16 sections, 10 equations, 5 figures, 4 tables, 1 algorithm.

Figures (5)

  • Figure 1: Large-scale online latent mapping with LatentAM on a custom multi-floor dataset spanning approximately 530 m of trajectory over 14.6 minutes. The system incrementally constructs a latent 3D Gaussian map from streaming RGB-D observations, while supporting plug-and-play integration with different visual–language models including CLIP radford2021learning, DINOv3 simeoni2025dinov3, and LSeg li2022language. The resulting latent map directly supports downstream open-vocabulary perception tasks such as language-driven querying in 3D.
  • Figure 2: Overview of the proposed online dictionary learning and memory attention pipeline.
  • Figure 3: 3D segmentation results of the proposed method and baselines on Office0 (left), TUM3 (middle), and Floor2 (right) scenes. In the Floor2 scene, Feature-3DGS fails to detect the desk segments (yellow arrow) while our method succeeds (green arrow).
  • Figure 4: Comparison between dictionary learning and distillation zhou2024feature on TUM1 and TUM3. The semantic segmentation results are overlaid, with the ground-truth object (book) highlighted.
  • Figure 5: Comparison of segmentation results with and without trust-region regularization. The target word is monitor. Without regularization, the model misdectects monitors on the wall (yellow arrow), whereas the proposed method suppresses false positives.