Table of Contents
Fetching ...

From sunblock to softblock: Analyzing the correlates of neology in published writing and on social media

Maria Ryskina, Matthew R. Gormley, Kyle Mahowald, David R. Mortensen, Taylor Berg-Kirkpatrick, Vivek Kulkarni

Abstract

Living languages are shaped by a host of conflicting internal and external evolutionary pressures. While some of these pressures are universal across languages and cultures, others differ depending on the social and conversational context: language use in newspapers is subject to very different constraints than language use on social media. Prior distributional semantic work on English word emergence (neology) identified two factors correlated with creation of new words by analyzing a corpus consisting primarily of historical published texts (Ryskina et al., 2020, arXiv:2001.07740). Extending this methodology to contextual embeddings in addition to static ones and applying it to a new corpus of Twitter posts, we show that the same findings hold for both domains, though the topic popularity growth factor may contribute less to neology on Twitter than in published writing. We hypothesize that this difference can be explained by the two domains favouring different neologism formation mechanisms.

From sunblock to softblock: Analyzing the correlates of neology in published writing and on social media

Abstract

Living languages are shaped by a host of conflicting internal and external evolutionary pressures. While some of these pressures are universal across languages and cultures, others differ depending on the social and conversational context: language use in newspapers is subject to very different constraints than language use on social media. Prior distributional semantic work on English word emergence (neology) identified two factors correlated with creation of new words by analyzing a corpus consisting primarily of historical published texts (Ryskina et al., 2020, arXiv:2001.07740). Extending this methodology to contextual embeddings in addition to static ones and applying it to a new corpus of Twitter posts, we show that the same findings hold for both domains, though the topic popularity growth factor may contribute less to neology on Twitter than in published writing. We hypothesize that this difference can be explained by the two domains favouring different neologism formation mechanisms.
Paper Structure (37 sections, 5 equations, 4 figures, 3 tables)

This paper contains 37 sections, 5 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Experimental comparison between the neighbourhoods of neologisms (blue bars) and control words (red bars) in the published writing domain. The three plots in each row correspond to three measures: the number of historical neighbours a word has (left), how monotonically these neighbours grow in frequency (centre), and the linear regression slope of their growth (right). The x-axis on all plots corresponds to the neighbourhood size (defined by the cosine similarity threshold $\tau$). The top and bottom rows show the results with the static Word2Vec embeddings and the contextual RoBERTa embeddings respectively. Error bars represent standard error over words. The number of asterisks above a pair of bars indicates the statistical significance of their difference per Wilcoxon signed-rank test: *** for $p < 0.001$, ** for $0.001 \leq p < 0.01$, * for $0.01 \leq p < 0.05$, none for $p \geq 0.05$.
  • Figure 2: Experimental comparison between the neighbourhoods of neologisms (blue bars) and control words (red bars) in the Twitter domain. The three plots in each row correspond to three measures: the number of historical neighbours a word has (left), how monotonically these neighbours grow in frequency (centre), and the linear regression slope of their growth (right). The x-axis on all plots corresponds to the neighbourhood size (defined by the cosine similarity threshold $\tau$). The top and bottom rows show the results with the static Word2Vec embeddings and the contextual RoBERTa embeddings respectively. Error bars represent standard error over words. The number of asterisks above a pair of bars indicates the statistical significance of their difference per Wilcoxon signed-rank test: *** for $p < 0.001$, ** for $0.001 \leq p < 0.01$, * for $0.01 \leq p < 0.05$, none for $p \geq 0.05$.
  • Figure 3: Experimental comparison between the neighbourhoods of neologisms (blue bars) and control words (red bars) in the published writing domain. Results are reported for 755 neologism--control pairs created from the original, non-filtered neologism list of 1000 candidate neologisms. The three plots in each row correspond to three measures: the number of historical neighbours a word has (left), how monotonically these neighbours grow in frequency (centre), and the linear regression slope of their growth (right). The x-axis on all plots corresponds to the neighbourhood size (defined by the cosine similarity threshold $\tau$). The top and bottom rows show the results with the static Word2Vec embeddings and the contextual RoBERTa embeddings respectively. Error bars represent standard error over words.
  • Figure 4: Experimental comparison between the neighbourhoods of neologisms (blue bars) and control words (red bars) in the Twitter domain. Results are reported for 451 neologism--control pairs created from the original, non-filtered neologism list of 938 candidate neologisms. The three plots in each row correspond to three measures: the number of historical neighbours a word has (left), how monotonically these neighbours grow in frequency (centre), and the linear regression slope of their growth (right). The x-axis on all plots corresponds to the neighbourhood size (defined by the cosine similarity threshold $\tau$). The top and bottom rows show the results with the static Word2Vec embeddings and the contextual RoBERTa embeddings respectively. Error bars represent standard error over words.