Referential communication in heterogeneous communities of pre-trained visual deep networks

Matéo Mahaut; Francesca Franzon; Roberto Dessì; Marco Baroni

Referential communication in heterogeneous communities of pre-trained visual deep networks

Matéo Mahaut, Francesca Franzon, Roberto Dessì, Marco Baroni

TL;DR

The paper addresses cross-architecture communication among pre-trained vision networks by introducing a light, trainable communication layer that enables a shared referential protocol to emerge in a self-supervised setting. It demonstrates strong referential accuracy across homogeneous, heterogeneous, and population training conditions, and shows that the protocol generalizes to unseen object categories and datasets, with $64$-dimensional messages typically outperforming $16$-dimensional ones. A new agent can rapidly learn the established protocol, indicating potential for a universal, transferable protocol across models. Analyses using Gaussian-blur perturbations and Sparse Autoencoders suggest the protocol encodes high-level semantic features rather than relying on low-level image details, underscoring its functional and interpretive value for cross-model communication.

Abstract

As large pre-trained image-processing neural networks are being embedded in autonomous agents such as self-driving cars or robots, the question arises of how such systems can communicate with each other about the surrounding world, despite their different architectures and training regimes. As a first step in this direction, we systematically explore the task of referential communication in a community of heterogeneous state-of-the-art pre-trained visual networks, showing that they can develop, in a self-supervised way, a shared protocol to refer to a target object among a set of candidates. This shared protocol can also be used, to some extent, to communicate about previously unseen object categories of different granularity. Moreover, a visual network that was not initially part of an existing community can learn the community's protocol with remarkable ease. Finally, we study, both qualitatively and quantitatively, the properties of the emergent protocol, providing some evidence that it is capturing high-level semantic features of objects.

Referential communication in heterogeneous communities of pre-trained visual deep networks

TL;DR

-dimensional messages typically outperforming

-dimensional ones. A new agent can rapidly learn the established protocol, indicating potential for a universal, transferable protocol across models. Analyses using Gaussian-blur perturbations and Sparse Autoencoders suggest the protocol encodes high-level semantic features rather than relying on low-level image details, underscoring its functional and interpretive value for cross-model communication.

Abstract

Paper Structure (40 sections, 1 equation, 17 figures, 23 tables)

This paper contains 40 sections, 1 equation, 17 figures, 23 tables.

Introduction
Related work
Deep net emergent communication
Representation similarity, model stitching and multimodal representation learning
Self-supervised learning for image classification
Multi-agent systems
Setup
The referential communication game
Agent architectures and training
Pre-trained vision modules
Trainable communication components
Training
Datasets
Experiments
Referential communication of homogeneous and heterogeneous networks
...and 25 more sections

Figures (17)

Figure 1: Referential game setup and agent architectures. A target image is input to the sender, that extracts a vector representation of it by passing it through a pre-trained frozen visual network. This vector representation is fed to the feed-forward Communication Module, that generates a message consisting of another continuous vector, that is passed as one of the inputs to the Receiver. The receiver also processes each candidate image it gets as input by passing it through a pre-trained frozen visual network (which can have a different architecture from the one of the sender), obtaining a set of vector representations. These are fed to the receiver's Communication Module, another feed-forward component that maps them to vectors in the same space as the sender message embedding. The Selection Module of the receiver simply consists of a parameter-free cosine similarity computation between the message and each image representation, followed by Softmax normalization. The receiver is said to have correctly identified the target if the largest value in the resulting probability distribution corresponds to the index of the target in the candidate array. Note that no parameters are shared between sender and receiver (except those of the frozen visual modules in the case in which the two agents are using homogeneous visual architectures).
Figure 2: Test accuracy and learning speed of learner agents (left: 16-dimensional messages; right: 64-dimensional messages). Blue line: learning curve on test data for learner agent added to a communicating pair, averaged across all possible heterogeneous triples. Orange line: learning curve for a learner agent added to an existing community, averaged across all possible leave-one-out cases. Vertical bars indicate standard deviation across cases. As a baseline for learning speed, the dashed green line shows the learning curve when training the whole 7x7 population at once from scratch.
Figure 3: Top: ImageNet1k images which have messages closest to a chosen one-hot vector in SAE space. Bottom Left: CelebA images with messages closest to the same sparse dimension. Bottom right: 5 closest Places205 Images. No image from CelebA and Places205 is seen at training time by either sender, receiver or SAE.
Figure 4: Accuracy after various image perturbations for the one-to-one and population setups (left). Gaussian blur was uniformly sampled within the [0.1,10] range. Mean accuracies and corresponding standard deviations are given across all 36 possible agent pairs. We add a small horizontal offset to the accuracy values for different perturbations to make them all visible. The horizontal lines connecting one-to-one and population accuracies for the same perturbation are meant to ease comparison between the two settings. Random baseline is 1.6% (randomly selecting one of the 64 images). Perturbation examples are provided on the right.
Figure 5: Distributions of cosine distances between messages sent by two different senders for the same ImageNet1k training set images. The different senders are either trained in the heterogeneous one-to-one setup (blue) or in the same population (orange). As a baseline, the green boxplot shows distances between the messages produced for different input images by population-trained VGG 11 and ViT-B/16 senders.
...and 12 more figures

Referential communication in heterogeneous communities of pre-trained visual deep networks

TL;DR

Abstract

Referential communication in heterogeneous communities of pre-trained visual deep networks

Authors

TL;DR

Abstract

Table of Contents

Figures (17)