Table of Contents
Fetching ...

Molyé: A Corpus-based Approach to Language Contact in Colonial France

Rasul Dent, Juliette Janès, Thibault Clérice, Pedro Ortiz Suarez, Benoît Sagot

TL;DR

The paper presents Molyé, an open corpus linking French literary stereotypes and early French-based Creoles across four centuries to investigate language contact in colonial France. It combines 68 curated works from a larger set with TEI-encoded annotations and a structured labeling scheme to enable multi-label analysis of Creole-like features in European texts. Methodologically, it describes document discovery via disjunctive n-grams, TEI-encoded encoding, and rule-based linguistic tagging, yielding an 188,866-token resource suitable for diachronic sociolinguistic study. The work demonstrates identifiable European contact patterns, such as Baragouin morphosyntax and Creole-like pronoun use, and provides a platform for testing hypotheses about Creole origins and the European roots of contact phenomena. Overall, Molyé offers a reproducible, open dataset to advance historical sociolinguistics and Creole studies by bridging European literary representations and early Creole attestations.

Abstract

Whether or not several Creole languages which developed during the early modern period can be considered genetic descendants of European languages has been the subject of intense debate. This is in large part due to the absence of evidence of intermediate forms. This work introduces a new open corpus, the Molyé corpus, which combines stereotypical representations of three kinds of language variation in Europe with early attestations of French-based Creole languages across a period of 400 years. It is intended to facilitate future research on the continuity between contact situations in Europe and Creolophone (former) colonies.

Molyé: A Corpus-based Approach to Language Contact in Colonial France

TL;DR

The paper presents Molyé, an open corpus linking French literary stereotypes and early French-based Creoles across four centuries to investigate language contact in colonial France. It combines 68 curated works from a larger set with TEI-encoded annotations and a structured labeling scheme to enable multi-label analysis of Creole-like features in European texts. Methodologically, it describes document discovery via disjunctive n-grams, TEI-encoded encoding, and rule-based linguistic tagging, yielding an 188,866-token resource suitable for diachronic sociolinguistic study. The work demonstrates identifiable European contact patterns, such as Baragouin morphosyntax and Creole-like pronoun use, and provides a platform for testing hypotheses about Creole origins and the European roots of contact phenomena. Overall, Molyé offers a reproducible, open dataset to advance historical sociolinguistics and Creole studies by bridging European literary representations and early Creole attestations.

Abstract

Whether or not several Creole languages which developed during the early modern period can be considered genetic descendants of European languages has been the subject of intense debate. This is in large part due to the absence of evidence of intermediate forms. This work introduces a new open corpus, the Molyé corpus, which combines stereotypical representations of three kinds of language variation in Europe with early attestations of French-based Creole languages across a period of 400 years. It is intended to facilitate future research on the continuity between contact situations in Europe and Creolophone (former) colonies.
Paper Structure (16 sections, 1 figure, 1 table)