Table of Contents
Fetching ...

Word reuse and combination support efficient communication of emerging concepts

Aotao Xu, Charles Kemp, Lea Frermann, Yang Xu

TL;DR

This work offers a unified account proposing that word reuse and combination are constrained by a fundamental tradeoff between competing communicative needs: the need for minimizing word length and the need for maximizing informativeness.

Abstract

A key function of the lexicon is to express novel concepts as they emerge over time through a process known as lexicalization. The most common lexicalization strategies are the reuse and combination of existing words, but they have typically been studied separately in the areas of word meaning extension and word formation. Here we offer an information-theoretic account of how both strategies are constrained by a fundamental tradeoff between competing communicative pressures: word reuse tends to preserve the average length of word forms at the cost of less precision, while word combination tends to produce more informative words at the expense of greater word length. We test our proposal against a large dataset of reuse items and compounds that appeared in English, French and Finnish over the past century. We find that these historically emerging items achieve higher levels of communicative efficiency than hypothetical ways of constructing the lexicon, and both literal reuse items and compounds tend to be more efficient than their non-literal counterparts. These results suggest that reuse and combination are both consistent with a unified account of lexicalization grounded in the theory of efficient communication.

Word reuse and combination support efficient communication of emerging concepts

TL;DR

This work offers a unified account proposing that word reuse and combination are constrained by a fundamental tradeoff between competing communicative needs: the need for minimizing word length and the need for maximizing informativeness.

Abstract

A key function of the lexicon is to express novel concepts as they emerge over time through a process known as lexicalization. The most common lexicalization strategies are the reuse and combination of existing words, but they have typically been studied separately in the areas of word meaning extension and word formation. Here we offer an information-theoretic account of how both strategies are constrained by a fundamental tradeoff between competing communicative pressures: word reuse tends to preserve the average length of word forms at the cost of less precision, while word combination tends to produce more informative words at the expense of greater word length. We test our proposal against a large dataset of reuse items and compounds that appeared in English, French and Finnish over the past century. We find that these historically emerging items achieve higher levels of communicative efficiency than hypothetical ways of constructing the lexicon, and both literal reuse items and compounds tend to be more efficient than their non-literal counterparts. These results suggest that reuse and combination are both consistent with a unified account of lexicalization grounded in the theory of efficient communication.

Paper Structure

This paper contains 20 sections, 7 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Illustration of our theoretical proposal. Panel (A) illustrates the lexicalization of emerging concepts using examples from English during the historical interval 1980-2000. The existing lexicon $\mathcal{L}$ and the set of emerging concepts $\mathcal{C}^*$ at time $t_1$ are illustrated on the left. At a later time $t_2$, the attested encoding of the novel concepts $E^*$ enters the expanded lexicon $\mathcal{L}'$, which are shown on the right. Panel (B) illustrates the two opposing pressures in a communicative interaction taking place before $t_2$. Here the speaker intends to convey the emerging concept "cellphone" to a listener whose lexicon does not yet have a word for expressing it, and grey bars illustrate probability distributions over a universe of concepts $\mathcal{C}$ that capture uncertainty regarding the intended concept. Our proposal focuses on the pressure for minimizing the length of the utterance, and the pressure for minimizing information loss, or the difference between the speaker and listener distributions over concepts. Panel (C) illustrates possible encodings of the novel concepts in Panel (A). Each point corresponds to the average length and information loss of an encoding of the novel concepts, and the shaded area corresponds to costs that are not attainable. We propose that word reuse and combination reflect a tradeoff between these two costs, and that both attested reuse items and attested compounds achieve tradeoffs that are relatively efficient. Here the example encodings are simplified to contrast reuse and combination, and in reality an encoding can consist of both strategies.
  • Figure 2: Illustration comparing (A) attested reuse items and (B) attested compounds to the constructed baselines and the Pareto frontier. Every point corresponds to an encoding of emerging concepts for a specific language and interval. Attested cases are marked in blue, near-synonym baselines in light blue, and random baselines in grey. Black solid lines in the bottom left show the estimated Pareto frontier, and the shaded areas show costs that are not attainable.
  • Figure 3: Efficiency loss of attested encodings for (A) reuse items and (B) compounds relative to the average loss of baselines. Attested loss is marked in blue, and the average loss of near-synonym and random baselines is marked in light blue and grey, respectively. Error bars show bootstrapped 95% confidence intervals.
  • Figure 4: Efficiency loss of individual attested items for (A) reuse and (B) compounding and randomly sampled labels. The distributions for attested and random are marked in blue and grey, respectively. Examples in Table \ref{['table:data_examples']} are annotated.
  • Figure 5: Item-level illustration for (A) attested reuse items and (B) attested compounds. Headers correspond to the examples in Table \ref{['table:data_examples']}, with additional marking for literal items (lit.). Each dark blue dot corresponds to an attested form. Black dots correspond to the item-level Pareto frontier, and light blue dots correspond to the near-synonym set generated for this item; the size of markers for attested items is larger than the size of other markers for improved visibility. A sample of optimal labels and compound head words are shown as text. Note that the axes are swapped relative to Figure \ref{['fig:average_summary']} and the x-axis is truncated so there is more space to display optimal labels.