Table of Contents
Fetching ...

LuxBorrow: From Pompier to Pompjee, Tracing Borrowing in Luxembourgish

Nina Hosseini-Kivanani, Fred Philippy

Abstract

We present LuxBorrow, a borrowing-first analysis of Luxembourgish (LU) news spanning 27 years (1999-2025), covering 259,305 RTL articles and 43.7M tokens. Our pipeline combines sentence-level language identification (LU/DE/FR/EN) with a token-level borrowing resolver restricted to LU sentences, using lemmatization, a collected loanword registry, and compiled morphological and orthographic rules. Empirically, LU remains the matrix language across all documents, while multilingual practice is pervasive: 77.1% of articles include at least one donor language and 65.4% use three or four. Breadth does not imply intensity: median code-mixing index (CMI) increases from 3.90 (LU+1) to only 7.00 (LU+3), indicating localized insertions rather than balanced bilingual text. Domain and period summaries show moderate but persistent mixing, with CMI rising from 6.1 (1999-2007) to a peak of 8.4 in 2020. Token-level adaptations total 25,444 instances and exhibit a mixed profile: morphological 63.8%, orthographic 35.9%, lexical 0.3%. The most frequent individual rules are orthographic, such as on->oun and eur->er, while morphology is collectively dominant. Diachronically, code-switching intensifies, and morphologically adapted borrowings grow from a small base. French overwhelmingly supplies adapted items, with modest growth for German and negligible English. We advocate borrowing-centric evaluation, including borrowed token and type rates, donor entropy over borrowed items, and assimilation ratios, rather than relying only on document-level mixing indices.

LuxBorrow: From Pompier to Pompjee, Tracing Borrowing in Luxembourgish

Abstract

We present LuxBorrow, a borrowing-first analysis of Luxembourgish (LU) news spanning 27 years (1999-2025), covering 259,305 RTL articles and 43.7M tokens. Our pipeline combines sentence-level language identification (LU/DE/FR/EN) with a token-level borrowing resolver restricted to LU sentences, using lemmatization, a collected loanword registry, and compiled morphological and orthographic rules. Empirically, LU remains the matrix language across all documents, while multilingual practice is pervasive: 77.1% of articles include at least one donor language and 65.4% use three or four. Breadth does not imply intensity: median code-mixing index (CMI) increases from 3.90 (LU+1) to only 7.00 (LU+3), indicating localized insertions rather than balanced bilingual text. Domain and period summaries show moderate but persistent mixing, with CMI rising from 6.1 (1999-2007) to a peak of 8.4 in 2020. Token-level adaptations total 25,444 instances and exhibit a mixed profile: morphological 63.8%, orthographic 35.9%, lexical 0.3%. The most frequent individual rules are orthographic, such as on->oun and eur->er, while morphology is collectively dominant. Diachronically, code-switching intensifies, and morphologically adapted borrowings grow from a small base. French overwhelmingly supplies adapted items, with modest growth for German and negligible English. We advocate borrowing-centric evaluation, including borrowed token and type rates, donor entropy over borrowed items, and assimilation ratios, rather than relying only on document-level mixing indices.
Paper Structure (33 sections, 4 figures, 3 tables)

This paper contains 33 sections, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Temporal distribution of RTL news articles (1999--2025) with selected linguistic and technological milestones. The brown dotted line marks the 2020 orthography reform and onset of the COVID 19 period.
  • Figure 2: Pipeline for the initial automatic loanword detection. After completing this stage, the resulting loanword corpus was manually reviewed to identify major shortcomings, such as incorrectly tagged words, missing loanwords, or incomplete compound forms, and adjusted accordingly.
  • Figure 3: Distribution of observed borrowing pattern types across the RTL news corpus. Borrowing adaptation types across.
  • Figure 4: Diachronic evolution of code-switching in Luxembourgish news media (1999--2025). Left: monthly CS rate with five-year period means (dashed lines). Right: mean CS rate per period; error bars denote one standard deviation.