Table of Contents
Fetching ...

SimSiam Naming Game: A Unified Approach for Representation Learning and Emergent Communication

Nguyen Le Hoang, Tadahiro Taniguchi, Fang Tianwei, Akira Taniguchi

TL;DR

This work proposes SimSiam+VAE, a unified approach for both representation learning and emergent communication that outperforms both SimSiam and VI-SimSiam and applies the generative and Bayesian approach based on VI to develop internal representations and emergent language.

Abstract

Emergent communication, driven by generative models, enables agents to develop a shared language for describing their individual views of the same objects through interactions. Meanwhile, self-supervised learning (SSL), particularly SimSiam, uses discriminative representation learning to make representations of augmented views of the same data point closer in the representation space. Building on the prior work of VI-SimSiam, which incorporates a generative and Bayesian perspective into the SimSiam framework via variational inference (VI) interpretation, we propose SimSiam+VAE, a unified approach for both representation learning and emergent communication. SimSiam+VAE integrates a variational autoencoder (VAE) into the predictor of the SimSiam network to enhance representation learning and capture uncertainty. Experimental results show that SimSiam+VAE outperforms both SimSiam and VI-SimSiam. We further extend this model into a communication framework called the SimSiam Naming Game (SSNG), which applies the generative and Bayesian approach based on VI to develop internal representations and emergent language, while utilizing the discriminative process of SimSiam to facilitate mutual understanding between agents. In experiments with established models, despite the dynamic alternation of agent roles during interactions, SSNG demonstrates comparable performance to the referential game and slightly outperforms the Metropolis-Hastings naming game.

SimSiam Naming Game: A Unified Approach for Representation Learning and Emergent Communication

TL;DR

This work proposes SimSiam+VAE, a unified approach for both representation learning and emergent communication that outperforms both SimSiam and VI-SimSiam and applies the generative and Bayesian approach based on VI to develop internal representations and emergent language.

Abstract

Emergent communication, driven by generative models, enables agents to develop a shared language for describing their individual views of the same objects through interactions. Meanwhile, self-supervised learning (SSL), particularly SimSiam, uses discriminative representation learning to make representations of augmented views of the same data point closer in the representation space. Building on the prior work of VI-SimSiam, which incorporates a generative and Bayesian perspective into the SimSiam framework via variational inference (VI) interpretation, we propose SimSiam+VAE, a unified approach for both representation learning and emergent communication. SimSiam+VAE integrates a variational autoencoder (VAE) into the predictor of the SimSiam network to enhance representation learning and capture uncertainty. Experimental results show that SimSiam+VAE outperforms both SimSiam and VI-SimSiam. We further extend this model into a communication framework called the SimSiam Naming Game (SSNG), which applies the generative and Bayesian approach based on VI to develop internal representations and emergent language, while utilizing the discriminative process of SimSiam to facilitate mutual understanding between agents. In experiments with established models, despite the dynamic alternation of agent roles during interactions, SSNG demonstrates comparable performance to the referential game and slightly outperforms the Metropolis-Hastings naming game.

Paper Structure

This paper contains 21 sections, 31 equations, 4 figures, 3 tables, 2 algorithms.

Figures (4)

  • Figure 1: Illustrations of the SSL interpreted as a form of VI. (a): The PGM representation of the inference process in SSL. Observations $x_A$ and $x_B$ represent two augmented views (considered as multimodal observations) of the same data sample, derived from a dataset $D$. The arrows point from $x_A$ and $x_B$ to the latent variable $z$, indicating that the augmented views share a common latent representation $z$, which is inferred from these observations. (b): The SimSiam framework 2021_chen_simsiam. Two augmented views, $x_A$ and $x_B$, are processed through a shared backbone and projector network $f$ to produce latent representations $z_A$ and $z_B$. A predictor network $h$ generates a transformed representation $z'_A$, which is compared to $z_B$ using a similarity measure. A stop-gradient operation is applied to $z_B$ to prevent gradient flow from $z'_A$, ensuring stable training and avoiding model collapse. (c): The proposed VI-SimSiam framework 2023_nakamura_simsiam_vi extends SimSiam by modeling representation uncertainty. Latent representations $z_A$ and $z_B$ are produced similarly, but two predictors output the mean direction $\mu$ and concentration parameter $\kappa$ of the power spherical distribution, enabling both the representation and its uncertainty to be modeled.
  • Figure 2: Illustrations of the SimSiam+VAE. (a) The PGM representation of the generative and inference process in SimSiam+VAE. From the observations $x_A$ and $x_B$, the representation $z$ is inferred, which is subsequently used to infer latent variable $z$. Solid lines indicate the generative process (from $w$ to $z$), while dashed lines indicate the inference process (from $x_A$ and $x_B$ to $z$ and then to $w$). (b) Architecture of the SimSiam+VAE framework. Two augmented views, $x_A$ and $x_B$, are processed through a shared backbone and projector network $f$ to produce representations $z_A$ and $z_B$. The predictor $h$ incorporates VAE components: the predictor encoder outputs the mean $\mu_A$ and covariance $\Sigma_A$ of the distribution over the latent variable $w_A$. The predictor decoder reconstructs the representation $z'_A$ from $w_A$. The similarity between $z'_A$ and $z_B$ is measured, and a stop-gradient operation is applied to $z_B$ to prevent collapse.
  • Figure 3: The EmCom between two agents, A and B, based on the SimSiam Naming Game. (a) Two agents observe the same object from different perspectives. Each agent maps its observations to internal representations and uses them to infer and predict emergent language symbols, enabling them to communicate their perceptions and develop a shared emergent language. (b) The PGM of SSNG: Denote agent $* \in \{ A, B \}$. Solid lines represent the generative process, which starts from the shared latent variable $w$ to the representation $z_*$. Dashed lines represent the inference process, where each agent infers its representation $z_*$ from its observation $x_*$, and the shared message $w$ is inferred jointly from both agents' internal representations $z_A$ and $z_B$. (c) The structure of agents: Both agents $*$ have the same model architecture with a backbone and projector $f_*$ and the predictor $h_*$ acts as the language coder, consisting of an encoder $h_*^{\text{(enc)}}$ and a decoder $h_*^{\text{(dec)}}$. In the example shown, agent A (depicted as the speaker) generates and transmits a message $w_A$ to agent B (as the listener), who processes it through a predictor decoder, producing an internal representation $z'_A$, which is then compared to $z_B$ to measure their similarity.
  • Figure 4: Architecture of the language coder comprising an LSTM-based encoder-decoder structure. The LSTM Encoder generates a sequence of discrete tokens to form the message $w = (m1, m2, ..., mN)$ from a representation $z$ in an autoregressive manner, where each token is sampled from a discrete distribution over the vocabulary $V$. The LSTM Decoder then reconstructs a new representation $z'$ from the sequence of token embeddings, with its final hidden state serving as the output.

Theorems & Definitions (2)

  • proof
  • proof