Table of Contents
Fetching ...

An Attention Mechanism for Robust Multimodal Integration in a Global Workspace Architecture

Roland Bertin-Johannet, Lara Scipio, Leopold Maytié, Rufin VanRullen

TL;DR

The paper tackles robustness in multimodal fusion by introducing a Global Workspace (GW) architecture augmented with a top-down modality attention mechanism. By freezing a pretrained multimodal workspace and adding a lightweight attention controller, the approach learns to re-weight modalities under varying reliability without retraining the entire system, using a set-to-set broadcast formulation and a mix of translation, demi-cycle, cycle, and contrastive objectives. Empirical results on Simple Shapes and MM-IMDb 1.0 demonstrate improved noise robustness, strong cross-task generalization, and competitive performance on a real-world benchmark with favorable training efficiency. The work advances practical multimodal AI by enabling flexible, transferable modality selection within a GW framework, with potential extensions to dynamic data and additional modalities.

Abstract

Global Workspace Theory (GWT), inspired by cognitive neuroscience, posits that flexible cognition could arise via the attentional selection of a relevant subset of modalities within a multimodal integration system. This cognitive framework can inspire novel computational architectures for multimodal integration. Indeed, recent implementations of GWT have explored its multimodal representation capabilities, but the related attention mechanisms remain understudied. Here, we propose and evaluate a top-down attention mechanism to select modalities inside a global workspace. First, we demonstrate that our attention mechanism improves noise robustness of a global workspace system on two multimodal datasets of increasing complexity: Simple Shapes and MM-IMDb 1.0. Second, we highlight various cross-task and cross-modality generalization capabilities that are not shared by multimodal attention models from the literature. Comparing against existing baselines on the MM-IMDb 1.0 benchmark, we find our attention mechanism makes the global workspace competitive with the state of the art.

An Attention Mechanism for Robust Multimodal Integration in a Global Workspace Architecture

TL;DR

The paper tackles robustness in multimodal fusion by introducing a Global Workspace (GW) architecture augmented with a top-down modality attention mechanism. By freezing a pretrained multimodal workspace and adding a lightweight attention controller, the approach learns to re-weight modalities under varying reliability without retraining the entire system, using a set-to-set broadcast formulation and a mix of translation, demi-cycle, cycle, and contrastive objectives. Empirical results on Simple Shapes and MM-IMDb 1.0 demonstrate improved noise robustness, strong cross-task generalization, and competitive performance on a real-world benchmark with favorable training efficiency. The work advances practical multimodal AI by enabling flexible, transferable modality selection within a GW framework, with potential extensions to dynamic data and additional modalities.

Abstract

Global Workspace Theory (GWT), inspired by cognitive neuroscience, posits that flexible cognition could arise via the attentional selection of a relevant subset of modalities within a multimodal integration system. This cognitive framework can inspire novel computational architectures for multimodal integration. Indeed, recent implementations of GWT have explored its multimodal representation capabilities, but the related attention mechanisms remain understudied. Here, we propose and evaluate a top-down attention mechanism to select modalities inside a global workspace. First, we demonstrate that our attention mechanism improves noise robustness of a global workspace system on two multimodal datasets of increasing complexity: Simple Shapes and MM-IMDb 1.0. Second, we highlight various cross-task and cross-modality generalization capabilities that are not shared by multimodal attention models from the literature. Comparing against existing baselines on the MM-IMDb 1.0 benchmark, we find our attention mechanism makes the global workspace competitive with the state of the art.
Paper Structure (18 sections, 13 equations, 5 figures, 2 tables)

This paper contains 18 sections, 13 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Proposed architecture (illustrated for the Simple-Shapes dataset). A : GW multimodal representation model. Reading from left to right, we start with raw data (images, text, attribute vectors), encoded through modality-specific backbones, pretrained and frozen during GW learning. We project the modality-specific latents $X_i$ to amodal GW representations $g_i$ with MLP encoders $E_i$. These latent representations are then fused with a weighted average to produce a single vector $z$ (the role of the attention mechanism will be to set these weights $\alpha_i$). The GW representation $z$ can then broadcast back to the modality-specific latent spaces using GW decoders $D_i$. The GW encoders and decoders are pretrained with randomly chosen attention weights on each data sample. B : Our attention mechanism computes modality weights $\alpha_i$ from an initial fused GW latent $\mathbf{z}^{(init)}$. We start from the pre-fusion GW latents $g_i$ and form $\mathbf{z}^{(init)}$ by uniform fusion (equal weights) across modalities. A single shared Key matrix $K$ produces keys $k_i$ from each $g_i$, and a Query matrix $Q$ produces a query $q$ from $\mathbf{z}^{(init)}$. We then compute $\alpha_i$ via dot products $<q,k_i>$ followed by a softmax across modalities. These weights define the final attention-weighted fusion that yields the GW latent $\mathbf{z}$.
  • Figure 2: Simple Shapes (top): accuracy heatmaps for GMU, DynMM, and our method across train noise $\sigma$ (columns) and test noise $\sigma$ (rows), averaged over tasks and corrupted-modality choices. MM-IMDb (bottom): macro-F1 heatmaps across train and test corruption levels, averaged over tasks and corruption types.
  • Figure 3: In- vs. out-of-distribution leave-out generalization on Simple-Shapes and MM-IMDb. For each model, we report performance when the training task matches the evaluation task (in-distribution) versus when it differs (out-of-distribution) respectively. Results are averaged on 3 seeds.
  • Figure 4: Modality generalization test I: unseen clean modality. The systems are trained on all 5 classification tasks at once, with one of the 3 modalities always noised. For each model, the first bar represents accuracy under training conditions, and the second reflects test-time accuracy, where the left-out modality can be the one shown without noise. The bars correspond to average performance across all 3 possible left-out modalities, and all 5 tasks. For GMU and DynMM, we fine-tuned the classification probes on the new noise distribution before testing. Despite this fine-tuning, the systems could not generalize well to the new situation (hashed bars). For Our model, no fine-tuning was necessary (since the probes are always trained on clean inputs). Yet, our model could generalize well, and outperform the other baselines (but only using a trained attention strategy, not with random modality scores).
  • Figure 5: Modality generalization test II: unseen modality. We train the attention system on two and test on three modalities. Left : Our attention mechanism's performance vs. random fusion performance (no-attention), under both train and evaluation configurations. The bars show accuracy averaged across 3 possible left-out modalities, 5 tasks, and all combinations of noisy/clean modalities. Right : average attention scores given by our attention mechanism on the test dataset, as a function of the noised modality. We show results separately for each left-out modality, and notice that attention scores on the left-out modality are comparable to trained modalities: this means our attention mechanism perfectly generalizes to the left-out modality.