Table of Contents
Fetching ...

Guylingo: The Republic of Guyana Creole Corpora

Christopher Clarke, Roland Daynauth, Charlene Wilkinson, Hubert Devonish, Jason Mars

TL;DR

GuyLingo addresses the paucity of NLP resources for Caribbean Creoles by building a dedicated Creolese corpus for Guyana. The authors collect data from expert collaborations and online sources, encode it in the Cave-GLU phonemic system, and assemble 2373 Creole sentences with 4177 unique words to support English–Guyanese Creole translation. They develop the Guyanese Creole Translation Tool, curate a translation dataset of 1969 pairs (with 302 test pairs) and evaluate multiple MT models including T5, BART, Pegasus, and GPT-4, showing mixed results across directions. The work demonstrates AI-driven applications such as IRIS and WhatsApp agents to promote Creole usage and highlights opportunities and limitations for formal adoption and ongoing dataset updates.

Abstract

While major languages often enjoy substantial attention and resources, the linguistic diversity across the globe encompasses a multitude of smaller, indigenous, and regional languages that lack the same level of computational support. One such region is the Caribbean. While commonly labeled as "English speaking", the ex-British Caribbean region consists of a myriad of Creole languages thriving alongside English. In this paper, we present Guylingo: a comprehensive corpus designed for advancing NLP research in the domain of Creolese (Guyanese English-lexicon Creole), the most widely spoken language in the culturally rich nation of Guyana. We first outline our framework for gathering and digitizing this diverse corpus, inclusive of colloquial expressions, idioms, and regional variations in a low-resource language. We then demonstrate the challenges of training and evaluating NLP models for machine translation in Creole. Lastly, we discuss the unique opportunities presented by recent NLP advancements for accelerating the formal adoption of Creole languages as official languages in the Caribbean.

Guylingo: The Republic of Guyana Creole Corpora

TL;DR

GuyLingo addresses the paucity of NLP resources for Caribbean Creoles by building a dedicated Creolese corpus for Guyana. The authors collect data from expert collaborations and online sources, encode it in the Cave-GLU phonemic system, and assemble 2373 Creole sentences with 4177 unique words to support English–Guyanese Creole translation. They develop the Guyanese Creole Translation Tool, curate a translation dataset of 1969 pairs (with 302 test pairs) and evaluate multiple MT models including T5, BART, Pegasus, and GPT-4, showing mixed results across directions. The work demonstrates AI-driven applications such as IRIS and WhatsApp agents to promote Creole usage and highlights opportunities and limitations for formal adoption and ongoing dataset updates.

Abstract

While major languages often enjoy substantial attention and resources, the linguistic diversity across the globe encompasses a multitude of smaller, indigenous, and regional languages that lack the same level of computational support. One such region is the Caribbean. While commonly labeled as "English speaking", the ex-British Caribbean region consists of a myriad of Creole languages thriving alongside English. In this paper, we present Guylingo: a comprehensive corpus designed for advancing NLP research in the domain of Creolese (Guyanese English-lexicon Creole), the most widely spoken language in the culturally rich nation of Guyana. We first outline our framework for gathering and digitizing this diverse corpus, inclusive of colloquial expressions, idioms, and regional variations in a low-resource language. We then demonstrate the challenges of training and evaluating NLP models for machine translation in Creole. Lastly, we discuss the unique opportunities presented by recent NLP advancements for accelerating the formal adoption of Creole languages as official languages in the Caribbean.
Paper Structure (19 sections, 4 figures, 4 tables)

This paper contains 19 sections, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Map of Guyana and its neighboring territories
  • Figure 2: User Interface of Guyanese Creole Translation Tool. This tool allows experts to rapidly and iteratively create translation pairs using GPT-4 openai2023gpt4 as a generator.
  • Figure 3: Conversational Agent in Whatsapp speaking in Guyanese Creole.
  • Figure 4: Example GPT-4 Prompt with translation examples from peirs1902proverbs.