Table of Contents
Fetching ...

Semi-supervised Multimodal Representation Learning through a Global Workspace

Benjamin Devillers, Léopold Maytié, Rufin VanRullen

TL;DR

The paper addresses how to learn grounded, cross-modal representations from limited paired data by leveraging a Global Workspace–inspired shared latent space. It integrates translation, contrastive alignment, and cycle-consistency objectives within a semi-supervised framework, using pretrained unimodal modules connected via a fixed-capacity workspace. Empirical results on synthetic (Simple Shapes, Factory) and real-world (COCO) datasets show that a GW-based model with semi-supervision achieves efficient vision–language translation and robust cross-modal alignment, improving downstream transfer tasks and retrieval performance. The work suggests that GW-inspired architectures can approximate human-like frugal multimodal learning and sets the stage for extending to more modalities and temporal dynamics.

Abstract

Recent deep learning models can efficiently combine inputs from different modalities (e.g., images and text) and learn to align their latent representations, or to translate signals from one domain to another (as in image captioning, or text-to-image generation). However, current approaches mainly rely on brute-force supervised training over large multimodal datasets. In contrast, humans (and other animals) can learn useful multimodal representations from only sparse experience with matched cross-modal data. Here we evaluate the capabilities of a neural network architecture inspired by the cognitive notion of a "Global Workspace": a shared representation for two (or more) input modalities. Each modality is processed by a specialized system (pretrained on unimodal data, and subsequently frozen). The corresponding latent representations are then encoded to and decoded from a single shared workspace. Importantly, this architecture is amenable to self-supervised training via cycle-consistency: encoding-decoding sequences should approximate the identity function. For various pairings of vision-language modalities and across two datasets of varying complexity, we show that such an architecture can be trained to align and translate between two modalities with very little need for matched data (from 4 to 7 times less than a fully supervised approach). The global workspace representation can be used advantageously for downstream classification tasks and for robust transfer learning. Ablation studies reveal that both the shared workspace and the self-supervised cycle-consistency training are critical to the system's performance.

Semi-supervised Multimodal Representation Learning through a Global Workspace

TL;DR

The paper addresses how to learn grounded, cross-modal representations from limited paired data by leveraging a Global Workspace–inspired shared latent space. It integrates translation, contrastive alignment, and cycle-consistency objectives within a semi-supervised framework, using pretrained unimodal modules connected via a fixed-capacity workspace. Empirical results on synthetic (Simple Shapes, Factory) and real-world (COCO) datasets show that a GW-based model with semi-supervision achieves efficient vision–language translation and robust cross-modal alignment, improving downstream transfer tasks and retrieval performance. The work suggests that GW-inspired architectures can approximate human-like frugal multimodal learning and sets the stage for extending to more modalities and temporal dynamics.

Abstract

Recent deep learning models can efficiently combine inputs from different modalities (e.g., images and text) and learn to align their latent representations, or to translate signals from one domain to another (as in image captioning, or text-to-image generation). However, current approaches mainly rely on brute-force supervised training over large multimodal datasets. In contrast, humans (and other animals) can learn useful multimodal representations from only sparse experience with matched cross-modal data. Here we evaluate the capabilities of a neural network architecture inspired by the cognitive notion of a "Global Workspace": a shared representation for two (or more) input modalities. Each modality is processed by a specialized system (pretrained on unimodal data, and subsequently frozen). The corresponding latent representations are then encoded to and decoded from a single shared workspace. Importantly, this architecture is amenable to self-supervised training via cycle-consistency: encoding-decoding sequences should approximate the identity function. For various pairings of vision-language modalities and across two datasets of varying complexity, we show that such an architecture can be trained to align and translate between two modalities with very little need for matched data (from 4 to 7 times less than a fully supervised approach). The global workspace representation can be used advantageously for downstream classification tasks and for robust transfer learning. Ablation studies reveal that both the shared workspace and the self-supervised cycle-consistency training are critical to the system's performance.
Paper Structure (51 sections, 16 equations, 8 figures, 3 tables)

This paper contains 51 sections, 16 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Panel A: generic bimodal network. Inputs can be from two modalities $x$ and $y$ (for instance, visual images and text captions). $e_v$ and $e_t$ are feed-forward neural networks that project each modality into a latent space; $d_v$ and $d_t$ are decoders that decode the latent space into the respective modality. (Note that this generic model aims at clarifying our definitions, it does not yet correspond to our GW architecture). Panel B: illustration of the primary and secondary desirable properties for multimodal systems. Each arrow shows a learned path to convert one latent vector into another. For instance in $P_{dcy}$ we can convert from one domain to itself via the central representation. Note that the four properties are not independent but can be causally related, as we describe in relations $R_1$ to $R_4$.
  • Figure 2: Examples from the Simple Shapes dataset. Each image contains a unique object of differing shape, color, rotation, size, and position. The image is paired with a natural language sentence describing the attributes.
  • Figure 3: Examples from the Factory dataset. Each image is taken from a fixed point of view; a table is randomly positioned in the environment, while the other objects (robot, cones, crates, conveyer belt...) remain in a fixed position. Each image can be associated with a "proto-language" description of the table's attributes (position, orientation, color), or with a natural language English description (as shown on the right).
  • Figure 4: Diagram of our Global Workspace architecture. Specialist modules for vision and language have a blue background. We use each modality's encoder ($e_v$ or $e_t$) to project data samples into a common latent space (GW), and the corresponding decoders ($d_t$ or $d_v$) to translate GW activations into the input domains. In this figure, we make the assumption that the model verifies property \ref{['prop:cont']}, thus having the GW representation shared across modalities.
  • Figure 5: The left panel evaluates the primary properties (translation and contrastive alignment); the right panel assesses the secondary properties (cycle and demi-cycle consistency). Each point in each graph is a different model trained until convergence, using a particular number of matched bimodal examples $N$ (x-axis). Dashed lines correspond to GW models, and curves with markers are semi-supervised models. $P_\text{tr}\ \&\ P_\text{dcy}$ was included as a way to assess relation $R_3$. The first and second rows on the left display the test translation and contrastive losses of the selected models, respectively. The first and second rows on the right show the test cycle and demi-cycle losses, respectively. Columns refer to different language modalities (proto-language or natural language). (The vertical gray line in the leftmost column marks the chosen value of $N$ that will be used later to assess the influence of the total number of unsupervised data samples.)
  • ...and 3 more figures