Table of Contents
Fetching ...

Socially Responsible Data for Large Multilingual Language Models

Andrew Smart, Ben Hutchinson, Lameck Mbangula Amugongo, Suzanne Dikker, Alex Zito, Amber Ebinama, Zara Wudiri, Ding Wang, Erin van Liemt, João Sedoc, Seyi Olojo, Stanley Uwakwe, Edem Wornyo, Sonja Schmer-Galunder, Jamila Smith-Loud

TL;DR

This position paper argues that expanding multilingual LLMs must go beyond merely increasing non-English data and must address historical and ongoing power imbalances in data collection. It critiques Western-centric notions of language, proposes decolonizing methodologies (including Ubuntu-based ethics, community-based research, and data sovereignty), and offers twelve concrete recommendations to ensure linguistic rights, consent, and local governance. The authors highlight frameworks like language sovereignty networks and indigenous data governance to prevent extractive practices and promote equitable benefits for language communities. The work emphasizes epistemic humility and co-creation with communities to preserve cultural integrity while expanding access to multilingual AI tools. Overall, it provides a actionable roadmap for ethical, inclusive, and culturally informed data practices in large multilingual language models.

Abstract

Large Language Models (LLMs) have rapidly increased in size and apparent capabilities in the last three years, but their training data is largely English text. There is growing interest in multilingual LLMs, and various efforts are striving for models to accommodate languages of communities outside of the Global North, which include many languages that have been historically underrepresented in digital realms. These languages have been coined as "low resource languages" or "long-tail languages", and LLMs performance on these languages is generally poor. While expanding the use of LLMs to more languages may bring many potential benefits, such as assisting cross-community communication and language preservation, great care must be taken to ensure that data collection on these languages is not extractive and that it does not reproduce exploitative practices of the past. Collecting data from languages spoken by previously colonized people, indigenous people, and non-Western languages raises many complex sociopolitical and ethical questions, e.g., around consent, cultural safety, and data sovereignty. Furthermore, linguistic complexity and cultural nuances are often lost in LLMs. This position paper builds on recent scholarship, and our own work, and outlines several relevant social, cultural, and ethical considerations and potential ways to mitigate them through qualitative research, community partnerships, and participatory design approaches. We provide twelve recommendations for consideration when collecting language data on underrepresented language communities outside of the Global North.

Socially Responsible Data for Large Multilingual Language Models

TL;DR

This position paper argues that expanding multilingual LLMs must go beyond merely increasing non-English data and must address historical and ongoing power imbalances in data collection. It critiques Western-centric notions of language, proposes decolonizing methodologies (including Ubuntu-based ethics, community-based research, and data sovereignty), and offers twelve concrete recommendations to ensure linguistic rights, consent, and local governance. The authors highlight frameworks like language sovereignty networks and indigenous data governance to prevent extractive practices and promote equitable benefits for language communities. The work emphasizes epistemic humility and co-creation with communities to preserve cultural integrity while expanding access to multilingual AI tools. Overall, it provides a actionable roadmap for ethical, inclusive, and culturally informed data practices in large multilingual language models.

Abstract

Large Language Models (LLMs) have rapidly increased in size and apparent capabilities in the last three years, but their training data is largely English text. There is growing interest in multilingual LLMs, and various efforts are striving for models to accommodate languages of communities outside of the Global North, which include many languages that have been historically underrepresented in digital realms. These languages have been coined as "low resource languages" or "long-tail languages", and LLMs performance on these languages is generally poor. While expanding the use of LLMs to more languages may bring many potential benefits, such as assisting cross-community communication and language preservation, great care must be taken to ensure that data collection on these languages is not extractive and that it does not reproduce exploitative practices of the past. Collecting data from languages spoken by previously colonized people, indigenous people, and non-Western languages raises many complex sociopolitical and ethical questions, e.g., around consent, cultural safety, and data sovereignty. Furthermore, linguistic complexity and cultural nuances are often lost in LLMs. This position paper builds on recent scholarship, and our own work, and outlines several relevant social, cultural, and ethical considerations and potential ways to mitigate them through qualitative research, community partnerships, and participatory design approaches. We provide twelve recommendations for consideration when collecting language data on underrepresented language communities outside of the Global North.
Paper Structure (19 sections, 1 table)