Detecting Linguistic Diversity on Social Media
Sidney Wong, Benjamin Adams, Jonathan Dunn
TL;DR
This work investigates whether social media can supplement traditional census data to map linguistic diversity in Aotearoa New Zealand. By pairing census ground truth with the Corpus of Global Language Use (CGLU) social-media sub-corpus and applying two language-identification models (idNet and pacificLID), the authors derive $CR_{10}$ as a measure of linguistic diversity across national, regional, and local geographies, and examine temporal dynamics via monthly tweet frequencies. They find general alignment between census and social-media signals at the national level but notable regional and urban-rural differences, including a Wellington case study showing real-time shifts in language use during early COVID-19. The study demonstrates that social media data can provide rich, contemporaneous linguistic insights and detect demographic/sociopolitical effects on language, while highlighting limitations related to sample representativeness, platform biases, and the need for careful interpretation and ethical considerations. Overall, social-media data can augment official statistics to deliver finer-grained, timely views of linguistic diversity, informing language policy and revitalisation efforts when used alongside census data.
Abstract
This chapter explores the efficacy of using social media data to examine changing linguistic behaviour of a place. We focus our investigation on Aotearoa New Zealand where official statistics from the census is the only source of language use data. We use published census data as the ground truth and the social media sub-corpus from the Corpus of Global Language Use as our alternative data source. We use place as the common denominator between the two data sources. We identify the language conditions of each tweet in the social media data set and validated our results with two language identification models. We then compare levels of linguistic diversity at national, regional, and local geographies. The results suggest that social media language data has the possibility to provide a rich source of spatial and temporal insights on the linguistic profile of a place. We show that social media is sensitive to demographic and sociopolitical changes within a language and at low-level regional and local geographies.
