Collaborative Visual Place Recognition through Federated Learning

Mattia Dutto; Gabriele Berton; Debora Caldarola; Eros Fanì; Gabriele Trivigno; Carlo Masone

Collaborative Visual Place Recognition through Federated Learning

Mattia Dutto, Gabriele Berton, Debora Caldarola, Eros Fanì, Gabriele Trivigno, Carlo Masone

TL;DR

This paper tackles the challenge of training Visual Place Recognition models in a privacy-preserving, distributed setting by introducing FedVPR, a federated learning framework where geospatially diverse clients perform local mining and train on private data while a central server aggregates updates via FedAvg. It formalizes the problem for VPR in FL, analyzes data and system heterogeneity, and proposes three MSLS-based federated dataset splits (Proximity, Clustering, Random) to mimic real-world deployments. Through extensive experiments, the authors show that FedVPR can approach centralized performance with careful design choices, while highlighting the impacts of local data quantity, augmentations, and mining distributions on learning. The work provides a practical foundation for privacy-preserving VPR and opens avenues for applying federated learning to other image retrieval tasks on edge devices.

Abstract

Visual Place Recognition (VPR) aims to estimate the location of an image by treating it as a retrieval problem. VPR uses a database of geo-tagged images and leverages deep neural networks to extract a global representation, called descriptor, from each image. While the training data for VPR models often originates from diverse, geographically scattered sources (geo-tagged images), the training process itself is typically assumed to be centralized. This research revisits the task of VPR through the lens of Federated Learning (FL), addressing several key challenges associated with this adaptation. VPR data inherently lacks well-defined classes, and models are typically trained using contrastive learning, which necessitates a data mining step on a centralized database. Additionally, client devices in federated systems can be highly heterogeneous in terms of their processing capabilities. The proposed FedVPR framework not only presents a novel approach for VPR but also introduces a new, challenging, and realistic task for FL research, paving the way to other image retrieval tasks in FL.

Collaborative Visual Place Recognition through Federated Learning

TL;DR

Abstract

Paper Structure (19 sections, 4 equations, 5 figures, 10 tables)

This paper contains 19 sections, 4 equations, 5 figures, 10 tables.

Introduction
Related work
Method
Centralized VPR
Federated Visual Place Recognition (FedVPR)
Decentralizing the MSLS dataset for FL
Proposed FL datasets
Experiments and Results
Implementation details
Centralized baselines
FL baselines
Data Quantity Skewness in FedVPR
Heterogeneity of Local Augmentations
Impact of Data Distribution on Local Mining
Conclusion
...and 4 more sections

Figures (5)

Figure 1: Federated Visual Place Recognition (FedVPR): we revisit the training of Visual Place Recognition models from the perspective of Federated Learning, with clients distributed across geographical areas, each possessing heterogeneous computational and communication resources and availability. Instead of relying on a central database for mining, each client builds its own database of geo-tagged images and uses it for local training based on contrastive learning (step a.). Subsequently, it communicates its model weights to the server, where they are aggregated into a new global model (step b.).
Figure 2: FedVPR training. At each round $t$, the server sends the current global model to a set of active clients, e.g.client 1 and client 2 in the figure. Each client $i$ has access to its own local dataset $\mathcal{D}_i$, whose distribution is highly influenced by the user's geographical positions (hence the country flags on the local datasets). Differently from centralized VPR, in FedVPR the mining happens exploiting the client's previously collected images. Thus, given a query image, local optimization is based on a contrastive loss, which relies on a positive and negative images extracted from $\mathcal{D}_i$. Since each local dataset follows a different distribution, the resulting updated parameters vary from client to client (orange vs. purple updates). Lastly, the local parameters are sent back to the server, where they are aggregated with FedAvg.
Figure 3: Centralized setting. Comparison of R$@1$ (%) and computational time (hours) when varying the image resolution. Resolution greatly affects training time, and an optimal trade-off can be attained with minimal performance drops.
Figure 4: Clients distribution in the MSLS Proximity split.
Figure 5: Images distribution in the MSLS Proximity split.

Collaborative Visual Place Recognition through Federated Learning

TL;DR

Abstract

Collaborative Visual Place Recognition through Federated Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (5)