Table of Contents
Fetching ...

Pairing Orthographically Variant Literary Words to Standard Equivalents Using Neural Edit Distance Models

Craig Messner, Tom Lippincott

TL;DR

This work investigates pairing literary orthographic variants with standard spellings using neural edit-distance models, emphasizing that 19th-century literary orthography encodes stylistic meaning distinct from learner errors. It introduces a novel corpus of literary variants and compares neural edit-distance models trained on this data to those trained on learner-error orthography, examining how negative-sample strategies affect performance. Across two experiments—candidate filtering and pair prediction—the study reveals corpus-specific differences: random negatives help learner-based data, while mixed negatives benefit literature-based data, suggesting distinct transformation spaces. The findings advance understanding of domain-specific string-pairing challenges and offer directions for improving historical text normalization and cognate-like tasks in literary corpora.

Abstract

We present a novel corpus consisting of orthographically variant words found in works of 19th century U.S. literature annotated with their corresponding "standard" word pair. We train a set of neural edit distance models to pair these variants with their standard forms, and compare the performance of these models to the performance of a set of neural edit distance models trained on a corpus of orthographic errors made by L2 English learners. Finally, we analyze the relative performance of these models in the light of different negative training sample generation strategies, and offer concluding remarks on the unique challenge literary orthographic variation poses to string pairing methodologies.

Pairing Orthographically Variant Literary Words to Standard Equivalents Using Neural Edit Distance Models

TL;DR

This work investigates pairing literary orthographic variants with standard spellings using neural edit-distance models, emphasizing that 19th-century literary orthography encodes stylistic meaning distinct from learner errors. It introduces a novel corpus of literary variants and compares neural edit-distance models trained on this data to those trained on learner-error orthography, examining how negative-sample strategies affect performance. Across two experiments—candidate filtering and pair prediction—the study reveals corpus-specific differences: random negatives help learner-based data, while mixed negatives benefit literature-based data, suggesting distinct transformation spaces. The findings advance understanding of domain-specific string-pairing challenges and offer directions for improving historical text normalization and cognate-like tasks in literary corpora.

Abstract

We present a novel corpus consisting of orthographically variant words found in works of 19th century U.S. literature annotated with their corresponding "standard" word pair. We train a set of neural edit distance models to pair these variants with their standard forms, and compare the performance of these models to the performance of a set of neural edit distance models trained on a corpus of orthographic errors made by L2 English learners. Finally, we analyze the relative performance of these models in the light of different negative training sample generation strategies, and offer concluding remarks on the unique challenge literary orthographic variation poses to string pairing methodologies.
Paper Structure (13 sections, 2 figures, 3 tables)