Table of Contents
Fetching ...

CityGuessr: City-Level Video Geo-Localization on a Global Scale

Parth Parag Kulkarni, Gaurav Kumar Nayak, Mubarak Shah

TL;DR

This work proposes a novel problem of worldwide video geolocalization with the objective of hierarchically predicting the correct city, state/province, country, and continent, given a video, and introduces a new dataset, CityGuessr68k comprising of 68,269 videos from 166 cities all over the world.

Abstract

Video geolocalization is a crucial problem in current times. Given just a video, ascertaining where it was captured from can have a plethora of advantages. The problem of worldwide geolocalization has been tackled before, but only using the image modality. Its video counterpart remains relatively unexplored. Meanwhile, video geolocalization has also garnered some attention in the recent past, but the existing methods are all restricted to specific regions. This motivates us to explore the problem of video geolocalization at a global scale. Hence, we propose a novel problem of worldwide video geolocalization with the objective of hierarchically predicting the correct city, state/province, country, and continent, given a video. However, no large scale video datasets that have extensive worldwide coverage exist, to train models for solving this problem. To this end, we introduce a new dataset, CityGuessr68k comprising of 68,269 videos from 166 cities all over the world. We also propose a novel baseline approach to this problem, by designing a transformer-based architecture comprising of an elegant Self-Cross Attention module for incorporating scenes as well as a TextLabel Alignment strategy for distilling knowledge from textlabels in feature space. To further enhance our location prediction, we also utilize soft-scene labels. Finally we demonstrate the performance of our method on our new dataset as well as Mapillary(MSLS). Our code and datasets are available at: https://github.com/ParthPK/CityGuessr

CityGuessr: City-Level Video Geo-Localization on a Global Scale

TL;DR

This work proposes a novel problem of worldwide video geolocalization with the objective of hierarchically predicting the correct city, state/province, country, and continent, given a video, and introduces a new dataset, CityGuessr68k comprising of 68,269 videos from 166 cities all over the world.

Abstract

Video geolocalization is a crucial problem in current times. Given just a video, ascertaining where it was captured from can have a plethora of advantages. The problem of worldwide geolocalization has been tackled before, but only using the image modality. Its video counterpart remains relatively unexplored. Meanwhile, video geolocalization has also garnered some attention in the recent past, but the existing methods are all restricted to specific regions. This motivates us to explore the problem of video geolocalization at a global scale. Hence, we propose a novel problem of worldwide video geolocalization with the objective of hierarchically predicting the correct city, state/province, country, and continent, given a video. However, no large scale video datasets that have extensive worldwide coverage exist, to train models for solving this problem. To this end, we introduce a new dataset, CityGuessr68k comprising of 68,269 videos from 166 cities all over the world. We also propose a novel baseline approach to this problem, by designing a transformer-based architecture comprising of an elegant Self-Cross Attention module for incorporating scenes as well as a TextLabel Alignment strategy for distilling knowledge from textlabels in feature space. To further enhance our location prediction, we also utilize soft-scene labels. Finally we demonstrate the performance of our method on our new dataset as well as Mapillary(MSLS). Our code and datasets are available at: https://github.com/ParthPK/CityGuessr

Paper Structure

This paper contains 39 sections, 7 equations, 16 figures, 9 tables.

Figures (16)

  • Figure 1: Sample frames of videos from 22 different countries in the CityGuessr68k dataset. Each quartet represents a continent. The continents in order are, Asia, Africa, Europe, North America, South America and Oceania
  • Figure 2: Data distribution. A comparison of CityGuessr$68$k with Mapillary(MSLS) dataset. CityGuessr$68$k covers more regions of the world, and has a uniform spread around the globe.
  • Figure 3: Class Distribution. Bar chart for number of samples per city class. (please zoom in for clearer class labels)
  • Figure 4: Frequency distribution. Histograms for each hierarchy for a further statistical insight into data distribution of CityGuessr68k.
  • Figure 5: Schematic Illustration of the proposed Model Architecture. VideoMAE encoder outputs feature embeddings of the input video. The embeddings are then passed into 4 classifiers pertaining to 4 hierarchies. Their predictions are used for computing Geolocalization loss. Simultaneously prediction vectors are input into the Self-Cross Attention module, where vectors of all 4 hierarchies are concatenated and are attended to, by themselves and by each other to generate an intermediate attended vector($PV'$). In the attention weights($w$), the single colored weights along the diagonal refer to self attention weights, while the gradient double colored weights are the cross attention weights between vectors of those two different hierarchies. $PV'$ is passed simultaneously through $FFN_s$ to generate vector $PV'_s$ for Scene loss computation, and to the TextLabel Alignment module. There, it is passed through $FFN_t$ to generate vector $PV'_t$. $PV'_t$ is used for TextLabel Alignment with feature embeddings $F_t$ generated by the pretrained text-encoder from the label names of all 4 hierarchies.
  • ...and 11 more figures