Table of Contents
Fetching ...

Annotations on a Budget: Leveraging Geo-Data Similarity to Balance Model Performance and Annotation Cost

Oana Ignat, Longju Bai, Joan Nwatu, Rada Mihalcea

TL;DR

Vision-language models underperform for underrepresented geographies due to Western-dominated training data; annotation costs hinder collecting geo-diverse data. The authors propose a budget-aware approach that first identifies country-topic pairs where visual representations differ most from high-resource data and then utilizes data from visually similar countries to fill gaps. Using three representations (CLIP, ALIGN, BLIP-2), they assemble a low-resource GeoDE+Dollar Street dataset, map topics to high-resource datasets (ImageNet/LAION), and quantify cross-country visual similarity across 52 countries and 94 topics; 422 pairs were identified as particularly beneficial for annotation. Augmenting training data with visually similar countries improves topic classification, while geographical distance poorly predicts visual similarity, suggesting geo-informed, context-aware annotation strategies can reduce labeling costs and improve model inclusivity.

Abstract

Current foundation models have shown impressive performance across various tasks. However, several studies have revealed that these models are not effective for everyone due to the imbalanced geographical and economic representation of the data used in the training process. Most of this data comes from Western countries, leading to poor results for underrepresented countries. To address this issue, more data needs to be collected from these countries, but the cost of annotation can be a significant bottleneck. In this paper, we propose methods to identify the data to be annotated to balance model performance and annotation costs. Our approach first involves finding the countries with images of topics (objects and actions) most visually distinct from those already in the training datasets used by current large vision-language foundation models. Next, we identify countries with higher visual similarity for these topics and show that using data from these countries to supplement the training data improves model performance and reduces annotation costs. The resulting lists of countries and corresponding topics are made available at https://github.com/MichiganNLP/visual_diversity_budget.

Annotations on a Budget: Leveraging Geo-Data Similarity to Balance Model Performance and Annotation Cost

TL;DR

Vision-language models underperform for underrepresented geographies due to Western-dominated training data; annotation costs hinder collecting geo-diverse data. The authors propose a budget-aware approach that first identifies country-topic pairs where visual representations differ most from high-resource data and then utilizes data from visually similar countries to fill gaps. Using three representations (CLIP, ALIGN, BLIP-2), they assemble a low-resource GeoDE+Dollar Street dataset, map topics to high-resource datasets (ImageNet/LAION), and quantify cross-country visual similarity across 52 countries and 94 topics; 422 pairs were identified as particularly beneficial for annotation. Augmenting training data with visually similar countries improves topic classification, while geographical distance poorly predicts visual similarity, suggesting geo-informed, context-aware annotation strategies can reduce labeling costs and improve model inclusivity.

Abstract

Current foundation models have shown impressive performance across various tasks. However, several studies have revealed that these models are not effective for everyone due to the imbalanced geographical and economic representation of the data used in the training process. Most of this data comes from Western countries, leading to poor results for underrepresented countries. To address this issue, more data needs to be collected from these countries, but the cost of annotation can be a significant bottleneck. In this paper, we propose methods to identify the data to be annotated to balance model performance and annotation costs. Our approach first involves finding the countries with images of topics (objects and actions) most visually distinct from those already in the training datasets used by current large vision-language foundation models. Next, we identify countries with higher visual similarity for these topics and show that using data from these countries to supplement the training data improves model performance and reduces annotation costs. The resulting lists of countries and corresponding topics are made available at https://github.com/MichiganNLP/visual_diversity_budget.
Paper Structure (38 sections, 23 figures, 1 table)

This paper contains 38 sections, 23 figures, 1 table.

Figures (23)

  • Figure 1: Vision-language models work poorly on data from underrepresented countries. This is primarily due to the diverse appearance of topics (objects and actions) across countries (e.g., "toothbrush"). However, collecting diverse global data is very expensive. As solutions to budget annotations, we propose to (1) annotate the images visually different from the ones in high-resource datasets such as LAION or ImageNet; (2) supplement data from low-resource countries with data from visually similar countries.
  • Figure 2: Example images ("cooking pot") in low-resource data (left) vs. in high-resource data (right).
  • Figure 3: Similarity heatmap of (topic, country) pairs. Based on the average similarity score, rows and columns are sorted from the least to the most similar. The lighter the color, the lower the similarity between high-resource and low-resource data for that corresponding (topic, country) pair, the more beneficial it is to annotate. We highlight with black the pairs we determine to benefit the most from annotations. Grey cells have less than ten images and are therefore discarded. Best viewed in color.
  • Figure 4: PCA for the topic "toothbrush" for all countries that contain this topic in the low-resource data and in the high-resource data. The high-resource data point is highlighted with star symbol. The data is represented as the average of the CLIP representations.
  • Figure 5: Top three and last three countries (left) and topics (right) sorted by average similarity score.
  • ...and 18 more figures