Table of Contents
Fetching ...

Modeling Topics and Sociolinguistic Variation in Code-Switched Discourse: Insights from Spanish-English and Spanish-Guaraní

Nemika Tyagi, Nelvin Licona Guevara, Olga Kellert

TL;DR

This work addresses the lack of topic- and sociolinguistically annotated bilingual resources by proposing an LLM-assisted pipeline that labels topic, genre, and discourse-pragmatic functions in Spanish-English and Spanish-Guaraní data. It combines GPT-based topic modeling with sociolinguistic metadata to enable cross-linguistic analysis on high- and low-resource bilinguals, using the Miami Bilingual Corpus and a newly annotated Spanish-Guaraní dataset. The approach yields high labeling reliability and reveals distinct sociolinguistic patterns, including gender and dominance effects in Miami and a diglossic division between Guaraní and Spanish in Paraguay, with corpus-scale evidence. The results demonstrate scalable, interpretable resource enrichment for cross-linguistic bilingual research and offer methodological directions for integrating topic, discourse, and sociolinguistic analyses in multilingual NLP.

Abstract

This study presents an LLM-assisted annotation pipeline for the sociolinguistic and topical analysis of bilingual discourse in two typologically distinct contexts: Spanish-English and Spanish-Guaraní. Using large language models, we automatically labeled topic, genre, and discourse-pragmatic functions across a total of 3,691 code-switched sentences, integrated demographic metadata from the Miami Bilingual Corpus, and enriched the Spanish-Guaraní dataset with new topic annotations. The resulting distributions reveal systematic links between gender, language dominance, and discourse function in the Miami data, and a clear diglossic division between formal Guaraní and informal Spanish in Paraguayan texts. These findings replicate and extend earlier interactional and sociolinguistic observations with corpus-scale quantitative evidence. The study demonstrates that large language models can reliably recover interpretable sociolinguistic patterns traditionally accessible only through manual annotation, advancing computational methods for cross-linguistic and low-resource bilingual research.

Modeling Topics and Sociolinguistic Variation in Code-Switched Discourse: Insights from Spanish-English and Spanish-Guaraní

TL;DR

This work addresses the lack of topic- and sociolinguistically annotated bilingual resources by proposing an LLM-assisted pipeline that labels topic, genre, and discourse-pragmatic functions in Spanish-English and Spanish-Guaraní data. It combines GPT-based topic modeling with sociolinguistic metadata to enable cross-linguistic analysis on high- and low-resource bilinguals, using the Miami Bilingual Corpus and a newly annotated Spanish-Guaraní dataset. The approach yields high labeling reliability and reveals distinct sociolinguistic patterns, including gender and dominance effects in Miami and a diglossic division between Guaraní and Spanish in Paraguay, with corpus-scale evidence. The results demonstrate scalable, interpretable resource enrichment for cross-linguistic bilingual research and offer methodological directions for integrating topic, discourse, and sociolinguistic analyses in multilingual NLP.

Abstract

This study presents an LLM-assisted annotation pipeline for the sociolinguistic and topical analysis of bilingual discourse in two typologically distinct contexts: Spanish-English and Spanish-Guaraní. Using large language models, we automatically labeled topic, genre, and discourse-pragmatic functions across a total of 3,691 code-switched sentences, integrated demographic metadata from the Miami Bilingual Corpus, and enriched the Spanish-Guaraní dataset with new topic annotations. The resulting distributions reveal systematic links between gender, language dominance, and discourse function in the Miami data, and a clear diglossic division between formal Guaraní and informal Spanish in Paraguayan texts. These findings replicate and extend earlier interactional and sociolinguistic observations with corpus-scale quantitative evidence. The study demonstrates that large language models can reliably recover interpretable sociolinguistic patterns traditionally accessible only through manual annotation, advancing computational methods for cross-linguistic and low-resource bilingual research.

Paper Structure

This paper contains 31 sections, 2 figures, 5 tables.

Figures (2)

  • Figure 1: Bilingual asymmetries in the Miami corpus, showing variation in topic and function distributions between Spanish- and English-dominant contexts.
  • Figure 2: Language-dominance comparisons for topics and genres in the Spanish-Guaraní dataset. Each row displays two bars (Guaraní-dominant and Spanish-dominant counts); category lists are trimmed to the most frequent items for clarity.