Table of Contents
Fetching ...

The Sociolinguistic Foundations of Language Modeling

Jack Grieve, Sara Bartl, Matteo Fuoli, Jason Grafmiller, Weihang Huang, Alejandro Jawerbaum, Akira Murakami, Marcus Perlman, Dana Roemling, Bodo Winter

TL;DR

This paper reframes large language models as models of varieties of language, defined by external factors such as dialect, register, and period. By grounding corpus design and evaluation in sociolinguistic theory, it addresses five core challenges—social bias, domain adaptation, alignment, language change, and scale—through the lens of representing target varieties and their internal structure. The authors argue that carefully curating stratified, diverse corpora that capture the full varietal architecture of the target language can improve performance, reduce harms, and better align models with societal values. They also discuss continual updates to reflect language change and the emergence of machine-influenced varieties, emphasizing the practical impact of sociolinguistic insight on safe, effective, and equitable AI systems.

Abstract

In this paper, we introduce a sociolinguistic perspective on language modeling. We claim that large language models are inherently models of varieties of language, and we consider how this insight can inform the development and deployment of large language models. We begin by presenting a technical definition of the concept of a variety of language as developed in sociolinguistics. We then discuss how this perspective can help address five basic challenges in language modeling: social bias, domain adaptation, alignment, language change, and scale. Ultimately, we argue that it is crucial to carefully define and compile training corpora that accurately represent the specific varieties of language being modeled to maximize the performance and societal value of large language models.

The Sociolinguistic Foundations of Language Modeling

TL;DR

This paper reframes large language models as models of varieties of language, defined by external factors such as dialect, register, and period. By grounding corpus design and evaluation in sociolinguistic theory, it addresses five core challenges—social bias, domain adaptation, alignment, language change, and scale—through the lens of representing target varieties and their internal structure. The authors argue that carefully curating stratified, diverse corpora that capture the full varietal architecture of the target language can improve performance, reduce harms, and better align models with societal values. They also discuss continual updates to reflect language change and the emergence of machine-influenced varieties, emphasizing the practical impact of sociolinguistic insight on safe, effective, and equitable AI systems.

Abstract

In this paper, we introduce a sociolinguistic perspective on language modeling. We claim that large language models are inherently models of varieties of language, and we consider how this insight can inform the development and deployment of large language models. We begin by presenting a technical definition of the concept of a variety of language as developed in sociolinguistics. We then discuss how this perspective can help address five basic challenges in language modeling: social bias, domain adaptation, alignment, language change, and scale. Ultimately, we argue that it is crucial to carefully define and compile training corpora that accurately represent the specific varieties of language being modeled to maximize the performance and societal value of large language models.
Paper Structure (9 sections, 4 figures)

This paper contains 9 sections, 4 figures.

Figures (4)

  • Figure 1: Varieties of Language. This figure defines the concept of a variety of language, illustrating how the interaction between three distinct extra-linguistic factors – the social background of people who produce language (dialect), the social context in which language is produced (register), and the range of time over which language is produce (period) – can be used to specify a variety of language. It also illustrates how varieties of language are hierarchically organized, composed of smaller and smaller sub-varieties.
  • Figure 2: Representative Corpus Design. This figure presents a corpus as a representative sample of texts taken from a given variety of language (i.e., from a larger population of texts delimited by relevant extra-linguistic factors). This figure also illustrates how compiling a corpus that accurately represents a target variety requires access to an underlying model of that variety of language, including its internal sub-varieties, so that the corpus can be stratified so as to capture internal variation in that variety. Naïve corpus compilation strategies that rely on convenience sampling will generally lead to less representative samples.
  • Figure 3: Sociolinguistic Bias in Language Models. This figure illustrates how training language models on corpora that accurately represent the target variety of language including its internal structure, especially its constituent dialects, can help address social bias, including both quality-of-service harms and stereotyping. This is exemplified by comparing two hypothetical models which model American English but are trained on corpora that inaccurately and accurately represent regional dialect variation (based on Grieve, 2016) in this larger variety of language.
  • Figure 4: Sociolinguistic Adaptation of Language Models. This figure illustrates how an understanding of the sociolinguistic structure of varieties of languages can inform the adaptation of language models. Language model adaptation can be seen as the process of fine-tuning a base model, potentially in an iterative manner, to predict word tokens in a more narrowly defined variety of language that is subsumed by the larger variety of language represented by the base model.