Table of Contents
Fetching ...

EgoWalk: A Multimodal Dataset for Robot Navigation in the Wild

Timur Akhtyamov, Mohamad Al Mdfaa, Javier Antonio Ramirez, Sergey Bakulin, German Devchich, Denis Fatykhov, Alexander Mazurov, Kristina Zipa, Malik Mohrat, Pavel Kolesnik, Ivan Sosin, Gonzalo Ferrer

TL;DR

EgoWalk addresses the scarcity of large-scale, real-world navigation data with ground-truth trajectories and semantics by introducing a 50-hour egocentric dataset collected in diverse Moscow environments. It presents end-to-end data processing pipelines and two automatic annotation streams for traversability masks and natural language goals, enabling research across vision-only navigation, vision-language navigation, and semantic understanding. The paper validates EgoWalk through real-robot experiments, model benchmarking, and language annotation evaluation, while transparently discussing limitations such as odometry noise and heuristic annotations. By releasing raw data, trajectories, and auxiliary annotations along with open-source tools, EgoWalk aims to accelerate robust, semantics-aware navigation research with practical real-world impact.

Abstract

Data-driven navigation algorithms are critically dependent on large-scale, high-quality real-world data collection for successful training and robust performance in realistic and uncontrolled conditions. To enhance the growing family of navigation-related real-world datasets, we introduce EgoWalk - a dataset of 50 hours of human navigation in a diverse set of indoor/outdoor, varied seasons, and location environments. Along with the raw and Imitation Learning-ready data, we introduce several pipelines to automatically create subsidiary datasets for other navigation-related tasks, namely natural language goal annotations and traversability segmentation masks. Diversity studies, use cases, and benchmarks for the proposed dataset are provided to demonstrate its practical applicability. We openly release all data processing pipelines and the description of the hardware platform used for data collection to support future research and development in robot navigation systems.

EgoWalk: A Multimodal Dataset for Robot Navigation in the Wild

TL;DR

EgoWalk addresses the scarcity of large-scale, real-world navigation data with ground-truth trajectories and semantics by introducing a 50-hour egocentric dataset collected in diverse Moscow environments. It presents end-to-end data processing pipelines and two automatic annotation streams for traversability masks and natural language goals, enabling research across vision-only navigation, vision-language navigation, and semantic understanding. The paper validates EgoWalk through real-robot experiments, model benchmarking, and language annotation evaluation, while transparently discussing limitations such as odometry noise and heuristic annotations. By releasing raw data, trajectories, and auxiliary annotations along with open-source tools, EgoWalk aims to accelerate robust, semantics-aware navigation research with practical real-world impact.

Abstract

Data-driven navigation algorithms are critically dependent on large-scale, high-quality real-world data collection for successful training and robust performance in realistic and uncontrolled conditions. To enhance the growing family of navigation-related real-world datasets, we introduce EgoWalk - a dataset of 50 hours of human navigation in a diverse set of indoor/outdoor, varied seasons, and location environments. Along with the raw and Imitation Learning-ready data, we introduce several pipelines to automatically create subsidiary datasets for other navigation-related tasks, namely natural language goal annotations and traversability segmentation masks. Diversity studies, use cases, and benchmarks for the proposed dataset are provided to demonstrate its practical applicability. We openly release all data processing pipelines and the description of the hardware platform used for data collection to support future research and development in robot navigation systems.

Paper Structure

This paper contains 27 sections, 21 figures, 3 tables.

Figures (21)

  • Figure 1: General overview of the data collection and processing pipelines. Sensor and odometry data are extracted from 50 hours of egocentric recordings and can be directly used for general navigation-related tasks. An automatic traversability region and language goals annotation pipeline are introduced to enlarge the scope of potential applications.
  • Figure 2: Diversity of the dataset. Location labels were produced using a vision-language model hong2024cogvlm2.
  • Figure 3: Participant wearing the platform
  • Figure 4: Overview of our automatic natural language goal annotation pipeline.
  • Figure 5: Examples of the auto-generated traversability masks. Top row: RGB input images. Middle row: traversable masks selected by largest area. Bottom row: traversable masks selected by highest score.
  • ...and 16 more figures