Table of Contents
Fetching ...

BYE: Build Your Encoder with One Sequence of Exploration Data for Long-Term Dynamic Scene Understanding

Chenguang Huang, Shengchao Yan, Wolfram Burgard

TL;DR

BYE tackles long-term dynamic scene understanding by learning a per-scene, class-agnostic encoder from a single exploration sequence and storing the resulting embeddings in an object memory bank. It combines this scene-specific expertise with Vision Language Model features to enable open-vocabulary object association, achieving high accuracy in both simulation and real-world tests. The approach offers a practical path toward lifelong learning in robotics, enabling robust object association under relocation with efficient runtime suitable for real-time systems. By demonstrating effective integration of a lightweight, scene-tailored model with foundation-model signals, BYE provides a scalable strategy for spatio-temporal object tracking and open-vocabulary navigation in changing environments.

Abstract

Dynamic scene understanding remains a persistent challenge in robotic applications. Early dynamic mapping methods focused on mitigating the negative influence of short-term dynamic objects on camera motion estimation by masking or tracking specific categories, which often fall short in adapting to long-term scene changes. Recent efforts address object association in long-term dynamic environments using neural networks trained on synthetic datasets, but they still rely on predefined object shapes and categories. Other methods incorporate visual, geometric, or semantic heuristics for the association but often lack robustness. In this work, we introduce BYE, a class-agnostic, per-scene point cloud encoder that removes the need for predefined categories, shape priors, or extensive association datasets. Trained on only a single sequence of exploration data, BYE can efficiently perform object association in dynamically changing scenes. We further propose an ensembling scheme combining the semantic strengths of Vision Language Models (VLMs) with the scene-specific expertise of BYE, achieving a 7% improvement and a 95% success rate in object association tasks. Code and dataset are available at https://byencoder.github.io.

BYE: Build Your Encoder with One Sequence of Exploration Data for Long-Term Dynamic Scene Understanding

TL;DR

BYE tackles long-term dynamic scene understanding by learning a per-scene, class-agnostic encoder from a single exploration sequence and storing the resulting embeddings in an object memory bank. It combines this scene-specific expertise with Vision Language Model features to enable open-vocabulary object association, achieving high accuracy in both simulation and real-world tests. The approach offers a practical path toward lifelong learning in robotics, enabling robust object association under relocation with efficient runtime suitable for real-time systems. By demonstrating effective integration of a lightweight, scene-tailored model with foundation-model signals, BYE provides a scalable strategy for spatio-temporal object tracking and open-vocabulary navigation in changing environments.

Abstract

Dynamic scene understanding remains a persistent challenge in robotic applications. Early dynamic mapping methods focused on mitigating the negative influence of short-term dynamic objects on camera motion estimation by masking or tracking specific categories, which often fall short in adapting to long-term scene changes. Recent efforts address object association in long-term dynamic environments using neural networks trained on synthetic datasets, but they still rely on predefined object shapes and categories. Other methods incorporate visual, geometric, or semantic heuristics for the association but often lack robustness. In this work, we introduce BYE, a class-agnostic, per-scene point cloud encoder that removes the need for predefined categories, shape priors, or extensive association datasets. Trained on only a single sequence of exploration data, BYE can efficiently perform object association in dynamically changing scenes. We further propose an ensembling scheme combining the semantic strengths of Vision Language Models (VLMs) with the scene-specific expertise of BYE, achieving a 7% improvement and a 95% success rate in object association tasks. Code and dataset are available at https://byencoder.github.io.

Paper Structure

This paper contains 14 sections, 1 equation, 6 figures, 3 tables.

Figures (6)

  • Figure 1: BYE enables reliable and robust object association in long-term dynamic scenes where object relocation exists by training a per-scene encoder on one sequence of exploration data, without the need for large synthetic datasets, category assumptions, and shape priors. The object embedding memory bank stores the latent embeddings for all partial point cloud observations in the reference exploration data where each point represents a partial observation and each color represents an instance. The contrastive learning process trains the encoder to gather observations from the same instance while repelling observations from different instances.
  • Figure 2: Overview of the pipeline of BYE for long-term dynamic environment understanding. With the reference trial of exploration data, we first build an instance-level map using the RGB, depth, instance masks, and odometry inputs, from which we generate a partial object point cloud observations dataset. Later, we exploit the principles of contrastive learning to train a point cloud encoder from scratch. Finally, we encode all the partial observations in the dataset into latent embeddings and associate them with instance labels in the reference exploration trial as the object memory bank.
  • Figure 3: The architecture of the point cloud encoder. We first follow the architecture of DGCNN wang2019dgcnn and PointNet qi2017pointnet with the training scheme of SimCLR chen2020simple, which add one more MLP layer without normalization following of the embedding output layer and project the representation to low dimensional space for more efficient contrastive learning.
  • Figure 4: The process of querying the object memory bank with new exploration trial data. Given the RGB, depth, and instance masks in the new exploration trial, we extract the partial point cloud observation, encode the point cloud with the pre-trained per-scene point cloud encoder as in Sec. \ref{['subsec:train_encoder']}, and obtain a latent embedding which we use to look up the object memory bank (see Sec. \ref{['subsec:object_memory_bank_creation']}) and find the K nearest neighboring embeddings. After counting the neighboring embeddings' instance labels, we can associate the partial observation to an instance in the reference trial of exploration.
  • Figure 5: The tabletop and furniture setups in the real world. We take one sequence of data in each setup as the reference and the rest as the test data.
  • ...and 1 more figures