Table of Contents
Fetching ...

Incorporating Geo-Diverse Knowledge into Prompting for Increased Geographical Robustness in Object Recognition

Kyle Buettner, Sina Malakouti, Xiang Lorraine Li, Adriana Kovashka

TL;DR

The paper addresses geographical domain shifts in object recognition by proposing to inject geo-diverse descriptive knowledge into prompting for CLIP-based models. It combines CLIP internal geo prompts (CountryInPrompt), external descriptive knowledge from LLMs (CountryLLM), and a geography-knowledge regularization term to produce geo-generalizable class representations. Empirical results on DollarStreet and GeoNet show that geo-aware prompting yields meaningful gains across Africa, Asia, and the Americas and can be competitive with or exceed some target-shot baselines, demonstrating practical benefits for geographically robust vision-language systems. The work highlights the importance of aligning visual-language representations with diverse regional knowledge and provides a scalable pathway to improve fairness and robustness in real-world deployments.

Abstract

Existing object recognition models have been shown to lack robustness in diverse geographical scenarios due to domain shifts in design and context. Class representations need to be adapted to more accurately reflect an object concept under these shifts. In the absence of training data from target geographies, we hypothesize that geographically diverse descriptive knowledge of categories can enhance robustness. For this purpose, we explore the feasibility of probing a large language model for geography-based object knowledge, and we examine the effects of integrating knowledge into zero-shot and learnable soft prompting with CLIP. Within this exploration, we propose geography knowledge regularization to ensure that soft prompts trained on a source set of geographies generalize to an unseen target set. Accuracy gains over prompting baselines on DollarStreet while training only on Europe data are up to +2.8/1.2/1.6 on target data from Africa/Asia/Americas, and +4.6 overall on the hardest classes. Competitive performance is shown vs. few-shot target training, and analysis is provided to direct future study of geographical robustness.

Incorporating Geo-Diverse Knowledge into Prompting for Increased Geographical Robustness in Object Recognition

TL;DR

The paper addresses geographical domain shifts in object recognition by proposing to inject geo-diverse descriptive knowledge into prompting for CLIP-based models. It combines CLIP internal geo prompts (CountryInPrompt), external descriptive knowledge from LLMs (CountryLLM), and a geography-knowledge regularization term to produce geo-generalizable class representations. Empirical results on DollarStreet and GeoNet show that geo-aware prompting yields meaningful gains across Africa, Asia, and the Americas and can be competitive with or exceed some target-shot baselines, demonstrating practical benefits for geographically robust vision-language systems. The work highlights the importance of aligning visual-language representations with diverse regional knowledge and provides a scalable pathway to improve fairness and robustness in real-world deployments.

Abstract

Existing object recognition models have been shown to lack robustness in diverse geographical scenarios due to domain shifts in design and context. Class representations need to be adapted to more accurately reflect an object concept under these shifts. In the absence of training data from target geographies, we hypothesize that geographically diverse descriptive knowledge of categories can enhance robustness. For this purpose, we explore the feasibility of probing a large language model for geography-based object knowledge, and we examine the effects of integrating knowledge into zero-shot and learnable soft prompting with CLIP. Within this exploration, we propose geography knowledge regularization to ensure that soft prompts trained on a source set of geographies generalize to an unseen target set. Accuracy gains over prompting baselines on DollarStreet while training only on Europe data are up to +2.8/1.2/1.6 on target data from Africa/Asia/Americas, and +4.6 overall on the hardest classes. Competitive performance is shown vs. few-shot target training, and analysis is provided to direct future study of geographical robustness.
Paper Structure (15 sections, 8 equations, 9 figures, 16 tables)

This paper contains 15 sections, 8 equations, 9 figures, 16 tables.

Figures (9)

  • Figure 1: Descriptive knowledge can address concept shifts across geographies. Observe the wide range of object designs and contexts in the DollarStreet NEURIPS2022_5474d9d4 category tools around the world. Our work's premise is that textual representations for classes in vision-language models can be enhanced to better suit diverse object representations across geographies. Map made with plotly.
  • Figure 2: Geography knowledge regularization. To ensure robustness in soft prompt learning, we (1) incorporate knowledge internal to CLIP and externally obtained from an LLM. (2) This descriptive knowledge regularizes class representations when training on a specific source geography (e.g. Europe), thus (3) increasing robustness when generalizing to target geographies (e.g. Vietnam).
  • Figure 3: Geography knowledge-regularized soft prompts trained on source data (ours, green line) vs. few-shot soft prompts trained on target data (blue curve). (a) Src=Europe, Tgt=Africa,Asia,Amer.; (b) Src=USA,Tgt=Asia. Our 16-shot model trained on only source data (green) outperforms a model with prompts trained on 12 or 4 shots per class of target data (on DollarStreet&GeoNet, resp.), which is 1140&2400 images total.
  • Figure 4: Qualitative analysis. We show examples where geography-specific descriptors improve/hurt vs. general descriptors in zero-shot inference. We highlight the prediction's descriptors, bolding the highest activating one. Encoder=RN50.
  • Figure 5: UMAP mcinnes2018umap-software plot for CountryLLM and the category homes in DollarStreet. Country-specific descriptors are often close to those of other countries intra-continent, likely due to similar weather, environment, and/or economic conditions.
  • ...and 4 more figures