"Dialogue" vs "Dialog" in NLP and AI research: Statistics from a Confused Discourse
David Gros
TL;DR
This study investigates the spelling variation between 'dialogue' and 'dialog' in NLP/AI research using a large, data-driven corpus. It defines Dialog(ue) Papers and High Impact Dialog(ue) Venues, assembles data from the Semantic Scholar corpus up to March 2024, and analyzes distribution, time trends, author-level patterns, nationality effects, and contextual influences with a multi-method approach (noun-phrase analysis, RoBERTa embeddings, morphology, and source-code usage). Key findings include a dominant use of 'dialogue' (72%) with substantial use of 'dialog' (24%) and some mixed usage (5%), no clear long-term shift, and only weak evidence that context or nationality strongly predict spelling; code and compound usage hint at economy driving some choices. The results offer a descriptive framework for orthography in scientific discourse and highlight the need for body-text analyses and cross-field comparisons to understand spelling variance more deeply. The work provides practical insights for researchers, editors, and tooling to better navigate spelling conventions in computing literature.
Abstract
Within computing research, there are two spellings for an increasingly important term - dialogue and dialog. We analyze thousands of research papers to understand this "dialog(ue) debacle". Among publications in top venues that use "dialog(ue)" in the title or abstract, 72% use "dialogue", 24% use "dialog", and 5% use both in the same title and abstract. This split distribution is more common in Computing than any other academic discipline. We investigate trends over ~20 years of NLP/AI research, not finding clear evidence of a shift over time. Author nationality is weakly correlated with spelling choice, but far from explains the mixed use. Many prolific authors publish papers with both spellings. We use several methods (such as syntactic parses and LM embeddings) to study how dialog(ue) context influences spelling, finding limited influence. Combining these results together, we discuss different theories that might explain the dialog(ue) divergence.
