Hierarchical place recognition with omnidirectional images and curriculum learning-based loss functions
Marcos Alfaro, Juan José Cabrera, María Flores, Óscar Reinoso, Luis Payá
TL;DR
This work tackles Visual Place Recognition (VPR) under challenging, real-world conditions by combining omnidirectional panoramic imagery with a hierarchical coarse-to-fine localization pipeline. It introduces curriculum-learning–based triplet losses that progressively increase training difficulty, yielding more discriminative embeddings for both room-level retrieval and intra-room positioning. Across indoor and outdoor datasets, the proposed losses outperform standard triplet losses, demonstrate robustness to illumination changes, noise, occlusions, and motion blur, and achieve strong generalization with limited training data. The approach offers a practical, efficient solution for real-world robotic localization and provides public code to facilitate adoption and further research.
Abstract
This paper addresses Visual Place Recognition (VPR), which is essential for the safe navigation of mobile robots. The solution we propose employs panoramic images and deep learning models, which are fine-tuned with triplet loss functions that integrate curriculum learning strategies. By progressively presenting more challenging examples during training, these loss functions enable the model to learn more discriminative and robust feature representations, overcoming the limitations of conventional contrastive loss functions. After training, VPR is tackled in two steps: coarse (room retrieval) and fine (position estimation). The results demonstrate that the curriculum-based triplet losses consistently outperform standard contrastive loss functions, particularly under challenging perceptual conditions. To thoroughly assess the robustness and generalization capabilities of the proposed method, it is evaluated in a variety of indoor and outdoor environments. The approach is tested against common challenges in real operation conditions, including severe illumination changes, the presence of dynamic visual effects such as noise and occlusions, and scenarios with limited training data. The results show that the proposed framework performs competitively in all these situations, achieving high recognition accuracy and demonstrating its potential as a reliable solution for real-world robotic applications. The code used in the experiments is available at https://github.com/MarcosAlfaro/TripletNetworksIndoorLocalization.git.
