Examining Language Modeling Assumptions Using an Annotated Literary Dialect Corpus

Craig Messner; Tom Lippincott

Examining Language Modeling Assumptions Using an Annotated Literary Dialect Corpus

Craig Messner, Tom Lippincott

TL;DR

Evidence is found showing that choice of tokenization scheme meaningfully impact the type of orthographic information a model is able to surface, and that the “dialect effect” produced by intentional orthographic variation employs multiple linguistic channels.

Abstract

We present a dataset of 19th century American literary orthovariant tokens with a novel layer of human-annotated dialect group tags designed to serve as the basis for computational experiments exploring literarily meaningful orthographic variation. We perform an initial broad set of experiments over this dataset using both token (BERT) and character (CANINE)-level contextual language models. We find indications that the "dialect effect" produced by intentional orthographic variation employs multiple linguistic channels, and that these channels are able to be surfaced to varied degrees given particular language modelling assumptions. Specifically, we find evidence showing that choice of tokenization scheme meaningfully impact the type of orthographic information a model is able to surface.

Examining Language Modeling Assumptions Using an Annotated Literary Dialect Corpus

TL;DR

Abstract

Paper Structure (11 sections, 3 figures, 3 tables)

This paper contains 11 sections, 3 figures, 3 tables.

Introduction
Experiments
Setup
Procedure
Evaluation
Results and Discussion
Evaluating absolute
Evaluating relative
Evaluation in the light of Dtag and semantic information
Conclusions and Further Work
Limitations

Figures (3)

Figure 1: Full absolute set (T), SO absolute set (B) accuracy by $k$.
Figure 2: Purity across the full relative set (T) and across non order-swapped tokens (B)
Figure 3: Dtag purity over the obv token embedding (T) Dtag purity over the std-obv relative set (B)

Examining Language Modeling Assumptions Using an Annotated Literary Dialect Corpus

TL;DR

Abstract

Examining Language Modeling Assumptions Using an Annotated Literary Dialect Corpus

Authors

TL;DR

Abstract

Table of Contents

Figures (3)