Table of Contents
Fetching ...

An evaluation of CNN models and data augmentation techniques in hierarchical localization of mobile robots

J. J. Cabrera, O. J. Céspedes, S. Cebollada, O. Reinoso, L. Payá

TL;DR

This paper tackles robust indoor hierarchical localization for mobile robots using omnidirectional imagery. It evaluates multiple CNN backbones and a structured data augmentation strategy to enable a coarse room retrieval step followed by fine-grained image-to-map matching within the predicted room. The study provides detailed ablations across different illumination conditions, with ConvNeXt Large often delivering the best overall accuracy and real-time capability, and identifies how specific augmentation types influence performance, especially under challenging lighting. The findings inform algorithm and architecture choices for practical visual localization in dynamic indoor environments and offer publicly available code for reproducibility.

Abstract

This work presents an evaluation of CNN models and data augmentation to carry out the hierarchical localization of a mobile robot by using omnidireccional images. In this sense, an ablation study of different state-of-the-art CNN models used as backbone is presented and a variety of data augmentation visual effects are proposed for addressing the visual localization of the robot. The proposed method is based on the adaption and re-training of a CNN with a dual purpose: (1) to perform a rough localization step in which the model is used to predict the room from which an image was captured, and (2) to address the fine localization step, which consists in retrieving the most similar image of the visual map among those contained in the previously predicted room by means of a pairwise comparison between descriptors obtained from an intermediate layer of the CNN. In this sense, we evaluate the impact of different state-of-the-art CNN models such as ConvNeXt for addressing the proposed localization. Finally, a variety of data augmentation visual effects are separately employed for training the model and their impact is assessed. The performance of the resulting CNNs is evaluated under real operation conditions, including changes in the lighting conditions. Our code is publicly available on the project website https://github.com/juanjo-cabrera/IndoorLocalizationSingleCNN.git

An evaluation of CNN models and data augmentation techniques in hierarchical localization of mobile robots

TL;DR

This paper tackles robust indoor hierarchical localization for mobile robots using omnidirectional imagery. It evaluates multiple CNN backbones and a structured data augmentation strategy to enable a coarse room retrieval step followed by fine-grained image-to-map matching within the predicted room. The study provides detailed ablations across different illumination conditions, with ConvNeXt Large often delivering the best overall accuracy and real-time capability, and identifies how specific augmentation types influence performance, especially under challenging lighting. The findings inform algorithm and architecture choices for practical visual localization in dynamic indoor environments and offer publicly available code for reproducibility.

Abstract

This work presents an evaluation of CNN models and data augmentation to carry out the hierarchical localization of a mobile robot by using omnidireccional images. In this sense, an ablation study of different state-of-the-art CNN models used as backbone is presented and a variety of data augmentation visual effects are proposed for addressing the visual localization of the robot. The proposed method is based on the adaption and re-training of a CNN with a dual purpose: (1) to perform a rough localization step in which the model is used to predict the room from which an image was captured, and (2) to address the fine localization step, which consists in retrieving the most similar image of the visual map among those contained in the previously predicted room by means of a pairwise comparison between descriptors obtained from an intermediate layer of the CNN. In this sense, we evaluate the impact of different state-of-the-art CNN models such as ConvNeXt for addressing the proposed localization. Finally, a variety of data augmentation visual effects are separately employed for training the model and their impact is assessed. The performance of the resulting CNNs is evaluated under real operation conditions, including changes in the lighting conditions. Our code is publicly available on the project website https://github.com/juanjo-cabrera/IndoorLocalizationSingleCNN.git
Paper Structure (17 sections, 6 equations, 4 figures, 8 tables)

This paper contains 17 sections, 6 equations, 4 figures, 8 tables.

Figures (4)

  • Figure 1: Diagram of the proposed hierarchical localization. The test image $im_{test}$ is the input of the CNN, which predicts the most likely room $c_i$ and embeds the image into a global descriptor $\vec{d}_{test}$ by flattening the last activation map. This descriptor is compared with the descriptors from the training dataset included in the retrieved room by means of a nearest neighbour search. Consequently, the capture point of the image that corresponds to the most similar descriptor ($im_{c_i,k}$) is considered an estimation of the position where $im_{test}$ was captured.
  • Figure 2: Example of data augmentation where only one effect is applied over each image. (a) Original image, (b) spotlight effect, (c) shadow effect (d) general brightness, (e) general darkness, (f) contrast, (g) saturation and (h) rotation. The images contained in this dataset can be downloaded from the web site https://www.cas.kth.se/COLD/.
  • Figure 3: Hierarchical localization errors in meters for different CNN architectures. The box plots represent the distribution of errors, with whiskers indicating variability. The Mean Absolute Error for each model and condition is marked by a black dot and annotated with the specific error value. Results are obtained under different lighting conditions: cloudy (red), night (orange), sunny (yellow) and considering jointly the three conditions (green).
  • Figure 4: Hierarchical localization errors in meters when training the ConvNeXt Large architecture with different data augmentation effects. The box plots represent the distribution of errors, with whiskers indicating variability. The Mean Absolute Error for each model and condition is marked by a black dot and annotated with the specific error value. Results are obtained under different lighting conditions: cloudy (red), night (orange), sunny (yellow) and considering jointly the three conditions (green).