Table of Contents
Fetching ...

Validating and Exploring Large Geographic Corpora

Jonathan Dunn

TL;DR

This result shows how standard corpus creation techniques can accidentally exclude under-represented populations in corpora with a specific focus on under-represented languages and populations.

Abstract

This paper investigates the impact of corpus creation decisions on large multi-lingual geographic web corpora. Beginning with a 427 billion word corpus derived from the Common Crawl, three methods are used to improve the quality of sub-corpora representing specific language-country pairs like New Zealand English: (i) the agreement of independent language identification systems, (ii) hash-based deduplication, and (iii) location-specific outlier detection. The impact of each of these steps is then evaluated at the language level and the country level by using corpus similarity measures to compare each resulting corpus with baseline data sets. The goal is to understand the impact of upstream data cleaning decisions on downstream corpora with a specific focus on under-represented languages and populations. The evaluation shows that the validity of sub-corpora is improved with each stage of cleaning but that this improvement is unevenly distributed across languages and populations. This result shows how standard corpus creation techniques can accidentally exclude under-represented populations.

Validating and Exploring Large Geographic Corpora

TL;DR

This result shows how standard corpus creation techniques can accidentally exclude under-represented populations in corpora with a specific focus on under-represented languages and populations.

Abstract

This paper investigates the impact of corpus creation decisions on large multi-lingual geographic web corpora. Beginning with a 427 billion word corpus derived from the Common Crawl, three methods are used to improve the quality of sub-corpora representing specific language-country pairs like New Zealand English: (i) the agreement of independent language identification systems, (ii) hash-based deduplication, and (iii) location-specific outlier detection. The impact of each of these steps is then evaluated at the language level and the country level by using corpus similarity measures to compare each resulting corpus with baseline data sets. The goal is to understand the impact of upstream data cleaning decisions on downstream corpora with a specific focus on under-represented languages and populations. The evaluation shows that the validity of sub-corpora is improved with each stage of cleaning but that this improvement is unevenly distributed across languages and populations. This result shows how standard corpus creation techniques can accidentally exclude under-represented populations.
Paper Structure (10 sections, 1 equation, 5 figures, 4 tables)

This paper contains 10 sections, 1 equation, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Sequence of cleaning methods from cglu 4.3 to cglu 5.2
  • Figure 2: Map showing agreement between language identification models by country. A value of 0.80 means that 80% of samples receive the same language label from each model.
  • Figure 3: Accuracy of the outlier detection method for finding samples with injected noise. Ratio refers to the amount of noise added and Accuracy to the percent of such samples correctly identified.
  • Figure 4: Similarity of the Swiss German corpus to the benchmark language identification corpus over each stage of cleaning. Higher values indicate more similar corpora. Significance of differences is tested using an ANOVA, here with a value of $p<0.001$.
  • Figure 5: Similarity of the Chilean Spanish corpus to the benchmark corpus of tweets in Spanish from Chile over each stage of cleaning. Higher values indicate more similar corpora. Significance of differences is tested using an ANOVA, here with a value of $p<0.001$.