Table of Contents
Fetching ...

Mapping at First Sense: A Lightweight Neural Network-Based Indoor Structures Prediction Method for Robot Autonomous Exploration

Haojia Gao, Haohua Que, Kunrong Li, Weihao Shan, Mingkai Liu, Rong Zhao, Lei Mu, Xinghua Yang, Qi Wei, Fei Qiao

TL;DR

The paper tackles efficient autonomous exploration in unknown indoor environments by predicting unobserved map regions to guide planning. It introduces SenseMapNet, a lightweight dual-branch architecture that fuses convolutional encoding with a Transformer encoder to predict local occluded regions from the local observation map. A SenseMapDataset built from KTH and HouseExpo enables training and evaluation, with extensive comparisons to frontier-based exploration. Results show SenseMapNet achieves map reconstruction quality (SSIM 0.78, LPIPS 0.68, FID 239.79), reduces exploration time by 46.5% to 1248.68 s, and attains 88% coverage and 88% reconstruction accuracy, demonstrating practical benefits for indoor robotic exploration.

Abstract

Autonomous exploration in unknown environments is a critical challenge in robotics, particularly for applications such as indoor navigation, search and rescue, and service robotics. Traditional exploration strategies, such as frontier-based methods, often struggle to efficiently utilize prior knowledge of structural regularities in indoor spaces. To address this limitation, we propose Mapping at First Sense, a lightweight neural network-based approach that predicts unobserved areas in local maps, thereby enhancing exploration efficiency. The core of our method, SenseMapNet, integrates convolutional and transformerbased architectures to infer occluded regions while maintaining computational efficiency for real-time deployment on resourceconstrained robots. Additionally, we introduce SenseMapDataset, a curated dataset constructed from KTH and HouseExpo environments, which facilitates training and evaluation of neural models for indoor exploration. Experimental results demonstrate that SenseMapNet achieves an SSIM (structural similarity) of 0.78, LPIPS (perceptual quality) of 0.68, and an FID (feature distribution alignment) of 239.79, outperforming conventional methods in map reconstruction quality. Compared to traditional frontier-based exploration, our method reduces exploration time by 46.5% (from 2335.56s to 1248.68s) while maintaining a high coverage rate (88%) and achieving a reconstruction accuracy of 88%. The proposed method represents a promising step toward efficient, learning-driven robotic exploration in structured environments.

Mapping at First Sense: A Lightweight Neural Network-Based Indoor Structures Prediction Method for Robot Autonomous Exploration

TL;DR

The paper tackles efficient autonomous exploration in unknown indoor environments by predicting unobserved map regions to guide planning. It introduces SenseMapNet, a lightweight dual-branch architecture that fuses convolutional encoding with a Transformer encoder to predict local occluded regions from the local observation map. A SenseMapDataset built from KTH and HouseExpo enables training and evaluation, with extensive comparisons to frontier-based exploration. Results show SenseMapNet achieves map reconstruction quality (SSIM 0.78, LPIPS 0.68, FID 239.79), reduces exploration time by 46.5% to 1248.68 s, and attains 88% coverage and 88% reconstruction accuracy, demonstrating practical benefits for indoor robotic exploration.

Abstract

Autonomous exploration in unknown environments is a critical challenge in robotics, particularly for applications such as indoor navigation, search and rescue, and service robotics. Traditional exploration strategies, such as frontier-based methods, often struggle to efficiently utilize prior knowledge of structural regularities in indoor spaces. To address this limitation, we propose Mapping at First Sense, a lightweight neural network-based approach that predicts unobserved areas in local maps, thereby enhancing exploration efficiency. The core of our method, SenseMapNet, integrates convolutional and transformerbased architectures to infer occluded regions while maintaining computational efficiency for real-time deployment on resourceconstrained robots. Additionally, we introduce SenseMapDataset, a curated dataset constructed from KTH and HouseExpo environments, which facilitates training and evaluation of neural models for indoor exploration. Experimental results demonstrate that SenseMapNet achieves an SSIM (structural similarity) of 0.78, LPIPS (perceptual quality) of 0.68, and an FID (feature distribution alignment) of 239.79, outperforming conventional methods in map reconstruction quality. Compared to traditional frontier-based exploration, our method reduces exploration time by 46.5% (from 2335.56s to 1248.68s) while maintaining a high coverage rate (88%) and achieving a reconstruction accuracy of 88%. The proposed method represents a promising step toward efficient, learning-driven robotic exploration in structured environments.

Paper Structure

This paper contains 16 sections, 12 equations, 7 figures, 6 tables, 2 algorithms.

Figures (7)

  • Figure 1: Overview of the SenseMap Pipeline for Autonomous Exploration. This pipeline consists of three main stages: Observation, Prediction, and Planning. At time step $t$, the robot captures an Observation Map, extracting a Local Observation Map (highlighted in red) from the global representation. The local observation is then processed by the proposed SenseMapNet, generating a Local Prediction Map, which is further integrated into the Global Prediction Map. The updated map is used to identify unexplored frontiers via Frontier Clustering, followed by path planning using the A* algorithm. This pipeline enhances exploration efficiency by reducing uncertainty and improving map completeness.
  • Figure 2: Architecture of SenseMapNet for Local Map Prediction. The input local observation map undergoes two parallel processing streams: a convolutional encoder-decoder network and a Transformer-based encoding pipeline. The convolutional encoder extracts hierarchical spatial features, progressively reducing the resolution while increasing feature depth. Simultaneously, the input map is divided into non-overlapping patches, which are flattened and projected into an embedding space before being processed by the Transformer Encoder. The Transformer module captures long-range dependencies and global spatial relationships. The extracted multiscale features from both streams are fused through skip connections and multi-resolution aggregation, followed by a decoding process to reconstruct the local prediction map. This dual-branch structure enables the model to leverage both fine-grained local spatial details and high-level contextual information, improving prediction accuracy and robustness in autonomous navigation tasks.
  • Figure 3: Example of dataset samples. The left image represents the ground truth label map, where white pixels indicate free space and black pixels denote obstacles. The right image illustrates the corresponding observation map, obtained from the robot's sensor data. The observation map consists of three color-coded channels: blue for free space, green for uncertain regions, and black for unexplored areas. This dataset provides rich spatial information, enabling the model to learn and predict local map structures effectively.
  • Figure 4: Qualitative comparison of different loss functions and models for local map prediction. The first two columns display the ground truth label map and the corresponding observation map. The remaining columns show the predicted maps generated by different models using either mean squared error (MSE) loss or a hybrid loss (MIX), which combines perceptual and MSE losses. As observed, models trained solely with MSE loss tend to be overly conservative due to the class imbalance between free space and obstacles in the training data. The hybrid loss helps mitigate this issue, improving structure preservation and enhancing the quality of the predicted maps.
  • Figure 5: Qualitative comparison of different models on local map prediction. The first two columns represent the ground truth map and the corresponding observation map. The remaining columns display the predicted maps generated by various models, including SenseMapNet, SenseMapNetLarge, UNet, and LaMa-Fourier. These visual results demonstrate the effectiveness of different architectures in predicting local maps. Notably, SenseMapNet and its larger variant show superior structure preservation and spatial consistency compared to UNet and LaMa-Fourier. A detailed quantitative comparison of these models, including the number of parameters, SSIM, LPIPS, and FID, is presented in Tab. \ref{['tab5']}.
  • ...and 2 more figures