Combi-CAM: A Novel Multi-Layer Approach for Explainable Image Geolocalization

David Faget; José Luis Lisani; Miguel Colom

Combi-CAM: A Novel Multi-Layer Approach for Explainable Image Geolocalization

David Faget, José Luis Lisani, Miguel Colom

Abstract

Planet-scale photo geolocalization involves the intricate task of estimating the geographic location depicted in an image purely based on its visual features. While deep learning models, particularly convolutional neural networks (CNNs), have significantly advanced this field, understanding the reasoning behind their predictions remains challenging. In this paper, we present Combi-CAM, a novel method that enhances the explainability of CNN-based geolocalization models by combining gradient-weighted class activation maps obtained from several layers of the network architecture, rather than using only information from the deepest layer as is typically done. This approach provides a more detailed understanding of how different image features contribute to the model's decisions, offering deeper insights than the traditional approaches.

Combi-CAM: A Novel Multi-Layer Approach for Explainable Image Geolocalization

Abstract

Paper Structure (6 sections, 3 equations, 7 figures)

This paper contains 6 sections, 3 equations, 7 figures.

INTRODUCTION
GRAD-CAM: AN INTERPRETABILITY TOOL
INTERPRETABILITY IN GEOLOCALIZATION NETWORKS
COMBI-CAM
EXPERIMENTS AND DISCUSSION
CONCLUSION

Figures (7)

Figure 1: The Eiffel Tower and its replicas: Original in Paris (left), and replicas in Las Vegas (center) and Madrid (right).
Figure 2: Aerial image of Paris (France) showing a wide outlook of the city. It includes characteristic objects such as the Eiffel Tower, the Seine river, and buildings with particular architectural elements.
Figure 3: Results obtained by applying Grad-CAM on the last layer of selected blocks (from #0 to #31) of the EfficientNet-B4 architecture show that characteristic elements, such as the Eiffel Tower, become more prominent in the middle blocks of the analysis compared to the final blocks.
Figure 4: The maximum activation magnitude per block indicates that the highest activations occur in blocks #22 to #29, where the pixels of characteristic elements activate the most. More specifically, the most significant block is the 22 which prominently highlights the most recognizable element of Paris: the Eiffel Tower.
Figure 5: View of Sydney (Australia).
...and 2 more figures

Combi-CAM: A Novel Multi-Layer Approach for Explainable Image Geolocalization

Abstract

Combi-CAM: A Novel Multi-Layer Approach for Explainable Image Geolocalization

Authors

Abstract

Table of Contents

Figures (7)