Table of Contents
Fetching ...

Lewis's Signaling Game as beta-VAE For Natural Word Lengths and Segments

Ryo Ueda, Tadahiro Taniguchi

TL;DR

This work addresses the gap between emergent languages in emergent communication and natural-language statistics by reframing Lewis's signaling game as a $β$-VAE and optimizing an ELBO with a learnable prior. The approach introduces a prior over messages as a language model, enabling variable-length messages and a principled tradeoff between informativeness and processing cost via surprisal theory. Empirical results show improvements in segmentation-relevant properties (Zipf's law of abbreviation and Harris's articulation scheme) and related metrics, suggesting prior choice can guide emergent languages toward more natural structures. Overall, the paper offers a principled, generative framework for EC that connects representation learning, cognitive theories, and language statistics, with broad implications for designing more interpretable and human-like emergent languages.

Abstract

As a sub-discipline of evolutionary and computational linguistics, emergent communication (EC) studies communication protocols, called emergent languages, arising in simulations where agents communicate. A key goal of EC is to give rise to languages that share statistical properties with natural languages. In this paper, we reinterpret Lewis's signaling game, a frequently used setting in EC, as beta-VAE and reformulate its objective function as ELBO. Consequently, we clarify the existence of prior distributions of emergent languages and show that the choice of the priors can influence their statistical properties. Specifically, we address the properties of word lengths and segmentation, known as Zipf's law of abbreviation (ZLA) and Harris's articulation scheme (HAS), respectively. It has been reported that the emergent languages do not follow them when using the conventional objective. We experimentally demonstrate that by selecting an appropriate prior distribution, more natural segments emerge, while suggesting that the conventional one prevents the languages from following ZLA and HAS.

Lewis's Signaling Game as beta-VAE For Natural Word Lengths and Segments

TL;DR

This work addresses the gap between emergent languages in emergent communication and natural-language statistics by reframing Lewis's signaling game as a -VAE and optimizing an ELBO with a learnable prior. The approach introduces a prior over messages as a language model, enabling variable-length messages and a principled tradeoff between informativeness and processing cost via surprisal theory. Empirical results show improvements in segmentation-relevant properties (Zipf's law of abbreviation and Harris's articulation scheme) and related metrics, suggesting prior choice can guide emergent languages toward more natural structures. Overall, the paper offers a principled, generative framework for EC that connects representation learning, cognitive theories, and language statistics, with broad implications for designing more interpretable and human-like emergent languages.

Abstract

As a sub-discipline of evolutionary and computational linguistics, emergent communication (EC) studies communication protocols, called emergent languages, arising in simulations where agents communicate. A key goal of EC is to give rise to languages that share statistical properties with natural languages. In this paper, we reinterpret Lewis's signaling game, a frequently used setting in EC, as beta-VAE and reformulate its objective function as ELBO. Consequently, we clarify the existence of prior distributions of emergent languages and show that the choice of the priors can influence their statistical properties. Specifically, we address the properties of word lengths and segmentation, known as Zipf's law of abbreviation (ZLA) and Harris's articulation scheme (HAS), respectively. It has been reported that the emergent languages do not follow them when using the conventional objective. We experimentally demonstrate that by selecting an appropriate prior distribution, more natural segments emerge, while suggesting that the conventional one prevents the languages from following ZLA and HAS.
Paper Structure (18 sections, 2 theorems, 45 equations, 5 figures, 1 table, 1 algorithm)

This paper contains 18 sections, 2 theorems, 45 equations, 5 figures, 1 table, 1 algorithm.

Key Result

Lemma 1

The following equation holds: where $f_{\theta}:\mathcal{X}\xspace\times\mathcal{M}\xspace\to\mathbb{R}$ is an any function differentiable w.r.t $\theta$.

Figures (5)

  • Figure 1: Illustration of similarity between signaling games and (beta-)VAE.
  • Figure 2: Results for $n_{\textrm{bou}}$ (\ref{['A1']}), $n_{\textrm{seg}}$ (\ref{['A2']}), $\Delta_{\textrm{w},\textrm{c}}$ (\ref{['A3']}), C-TopSim (\ref{['A3']}), and W-TopSim (\ref{['A3']}) are shown in order from the left. The x-axis represents $(n_{\textrm{att}},n_{\textrm{val}})$ while the y-axis represents the values of each metric. The shaded regions and error bars represent the standard error of mean. The $\textrm{threshold}\xspace$ parameter is set to $0$. The blue plots represent the results for our ELBO-based objective $\mathcal{J}\xspace_{\textrm{ours}}\xspace$, the orange ones for (\ref{['BL: conventional']}) the conventional objective $\mathcal{J}\xspace_{\textrm{conv}}\xspace$ plus the entropy regularizer, and the grey ones for (\ref{['BL: priorExp']}) the ELBO-based objective whose prior is $P^{\textrm{prior}}\xspace_{\alpha\xspace}\xspace$. The apparent inferior performance of $\Delta_{\textrm{w},\textrm{c}}$ for $\mathcal{J}\xspace_{\textrm{ours}}\xspace$ compared to the baselines might be misleading. It is because $\mathcal{J}\xspace_{\textrm{ours}}\xspace$ greatly improves both C-TomSim and W-TopSim. The larger scale of their improvements could result in a seemingly worse $\Delta_{\textrm{w},\textrm{c}}$, but this does not necessarily indicate poorer performance.
  • Figure 3: Results for $n_{\textrm{bou}}$ (\ref{['A1']}), $n_{\textrm{seg}}$ (\ref{['A2']}), $\Delta_{\textrm{w},\textrm{c}}$ (\ref{['A3']}), C-TopSim (\ref{['A3']}), and W-TopSim (\ref{['A3']}) are shown in order from the left. The x-axis represents $(n_{\textrm{att}},n_{\textrm{val}})$ while the y-axis represents the values of each metric. The shaded regions and error bars represent the standard error of mean. The $\textrm{threshold}\xspace$ parameter is set to $0.25$. The blue plots represent the results for our ELBO-based objective $\mathcal{J}\xspace_{\textrm{ours}}\xspace$, the orange ones for (\ref{['BL: conventional']}) the conventional objective $\mathcal{J}\xspace_{\textrm{conv}}\xspace$ plus the entropy regularizer, and the grey ones for (\ref{['BL: priorExp']}) the ELBO-based objective whose prior is $P^{\textrm{prior}}\xspace_{\alpha\xspace}\xspace$. The apparent inferior performance of $\Delta_{\textrm{w},\textrm{c}}$ for $\mathcal{J}\xspace_{\textrm{ours}}\xspace$ compared to the baselines might be misleading. It is because $\mathcal{J}\xspace_{\textrm{ours}}\xspace$ greatly improves both C-TomSim and W-TopSim. The larger scale of their improvements could result in a seemingly worse $\Delta_{\textrm{w},\textrm{c}}$, but this does not necessarily indicate poorer performance.
  • Figure 4: Results for $n_{\textrm{bou}}$ (\ref{['A1']}), $n_{\textrm{seg}}$ (\ref{['A2']}), $\Delta_{\textrm{w},\textrm{c}}$ (\ref{['A3']}), C-TopSim (\ref{['A3']}), and W-TopSim (\ref{['A3']}) are shown in order from the left. The x-axis represents $(n_{\textrm{att}},n_{\textrm{val}})$ while the y-axis represents the values of each metric. The shaded regions and error bars represent the standard error of mean. The $\textrm{threshold}\xspace$ parameter is set to $0.25$. The blue plots represent the results for our ELBO-based objective $\mathcal{J}\xspace_{\textrm{ours}}\xspace$, the orange ones for (\ref{['BL: conventional']}) the conventional objective $\mathcal{J}\xspace_{\textrm{conv}}\xspace$ plus the entropy regularizer, and the grey ones for (\ref{['BL: priorExp']}) the ELBO-based objective whose prior is $P^{\textrm{prior}}\xspace_{\alpha\xspace}\xspace$. The apparent inferior performance of $\Delta_{\textrm{w},\textrm{c}}$ for $\mathcal{J}\xspace_{\textrm{ours}}\xspace$ compared to the baselines might be misleading. It is because $\mathcal{J}\xspace_{\textrm{ours}}\xspace$ greatly improves both C-TomSim and W-TopSim. The larger scale of their improvements could result in a seemingly worse $\Delta_{\textrm{w},\textrm{c}}$, but this does not necessarily indicate poorer performance.
  • Figure 5: Mean message length sorted by objects' frequency across 32 random seeds. A moving average with a window size of 10 is shown for readability.

Theorems & Definitions (6)

  • Remark 1
  • Lemma 1
  • Remark 2
  • Lemma 2
  • Remark 3: Curious Case on Length Penalty
  • Remark 4