BYE: Build Your Encoder with One Sequence of Exploration Data for Long-Term Dynamic Scene Understanding

Chenguang Huang; Shengchao Yan; Wolfram Burgard

BYE: Build Your Encoder with One Sequence of Exploration Data for Long-Term Dynamic Scene Understanding

Chenguang Huang, Shengchao Yan, Wolfram Burgard

TL;DR

BYE tackles long-term dynamic scene understanding by learning a per-scene, class-agnostic encoder from a single exploration sequence and storing the resulting embeddings in an object memory bank. It combines this scene-specific expertise with Vision Language Model features to enable open-vocabulary object association, achieving high accuracy in both simulation and real-world tests. The approach offers a practical path toward lifelong learning in robotics, enabling robust object association under relocation with efficient runtime suitable for real-time systems. By demonstrating effective integration of a lightweight, scene-tailored model with foundation-model signals, BYE provides a scalable strategy for spatio-temporal object tracking and open-vocabulary navigation in changing environments.

Abstract

Dynamic scene understanding remains a persistent challenge in robotic applications. Early dynamic mapping methods focused on mitigating the negative influence of short-term dynamic objects on camera motion estimation by masking or tracking specific categories, which often fall short in adapting to long-term scene changes. Recent efforts address object association in long-term dynamic environments using neural networks trained on synthetic datasets, but they still rely on predefined object shapes and categories. Other methods incorporate visual, geometric, or semantic heuristics for the association but often lack robustness. In this work, we introduce BYE, a class-agnostic, per-scene point cloud encoder that removes the need for predefined categories, shape priors, or extensive association datasets. Trained on only a single sequence of exploration data, BYE can efficiently perform object association in dynamically changing scenes. We further propose an ensembling scheme combining the semantic strengths of Vision Language Models (VLMs) with the scene-specific expertise of BYE, achieving a 7% improvement and a 95% success rate in object association tasks. Code and dataset are available at https://byencoder.github.io.

BYE: Build Your Encoder with One Sequence of Exploration Data for Long-Term Dynamic Scene Understanding

TL;DR

Abstract

BYE: Build Your Encoder with One Sequence of Exploration Data for Long-Term Dynamic Scene Understanding

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)