Table of Contents
Fetching ...

GATS: Gather-Attend-Scatter

Konrad Zolna, Serkan Cabi, Yutian Chen, Eric Lau, Claudio Fantacci, Jurgis Pasukonis, Jost Tobias Springenberg, Sergio Gomez Colmenarejo

TL;DR

GATS presents a general-purpose module for unifying pretrained foundation models across modalities by gathering activations, applying cross-modal attention, and scattering updates back into component models without finetuning them. The method interleaves with vanilla transformers, enabling cross-modal conditioning through a proportional layer grafting strategy and a steering mechanism to control which models are actively updated. Demonstrations include Atari Pong, Language-Table, and YCB robotic tasks with frozen language and vision models, plus a bimodal text-image model showing effective captioning and generation with shared GATS parameters. This approach offers a scalable, training-efficient path to building multimodal agents and generators by composing diverse, frozen pretrained models.

Abstract

As the AI community increasingly adopts large-scale models, it is crucial to develop general and flexible tools to integrate them. We introduce Gather-Attend-Scatter (GATS), a novel module that enables seamless combination of pretrained foundation models, both trainable and frozen, into larger multimodal networks. GATS empowers AI systems to process and generate information across multiple modalities at different rates. In contrast to traditional fine-tuning, GATS allows for the original component models to remain frozen, avoiding the risk of them losing important knowledge acquired during the pretraining phase. We demonstrate the utility and versatility of GATS with a few experiments across games, robotics, and multimodal input-output systems.

GATS: Gather-Attend-Scatter

TL;DR

GATS presents a general-purpose module for unifying pretrained foundation models across modalities by gathering activations, applying cross-modal attention, and scattering updates back into component models without finetuning them. The method interleaves with vanilla transformers, enabling cross-modal conditioning through a proportional layer grafting strategy and a steering mechanism to control which models are actively updated. Demonstrations include Atari Pong, Language-Table, and YCB robotic tasks with frozen language and vision models, plus a bimodal text-image model showing effective captioning and generation with shared GATS parameters. This approach offers a scalable, training-efficient path to building multimodal agents and generators by composing diverse, frozen pretrained models.

Abstract

As the AI community increasingly adopts large-scale models, it is crucial to develop general and flexible tools to integrate them. We introduce Gather-Attend-Scatter (GATS), a novel module that enables seamless combination of pretrained foundation models, both trainable and frozen, into larger multimodal networks. GATS empowers AI systems to process and generate information across multiple modalities at different rates. In contrast to traditional fine-tuning, GATS allows for the original component models to remain frozen, avoiding the risk of them losing important knowledge acquired during the pretraining phase. We demonstrate the utility and versatility of GATS with a few experiments across games, robotics, and multimodal input-output systems.
Paper Structure (51 sections, 5 equations, 12 figures, 2 tables)

This paper contains 51 sections, 5 equations, 12 figures, 2 tables.

Figures (12)

  • Figure 1: By leveraging GATS, a modular multimodal architecture can be constructed, integrating frozen pretrained foundation models.
  • Figure 2: Top. GATS can process sequences of activations from different models, and hence with different sizes. Each element's color corresponds to its modality. The same-colored embeddings have the same size. Bottom. Two examples. The GATS module has separate local context lengths for each modality. The visualisations show which embeddings are visible to the GATS module when the bold embedding is processed, assuming that each modality has the same context length of 2 (grayed embeddings are ignored by GATS and do not take part in its computations).
  • Figure 3: GATS interleaves with the red unimodal transformer processesing only red embeddings. GATS gathers two recent embeddings from each unimodal network, projects them to the common size, attends over them and scatters the output. The next layer of the red transformer processes activations altered by GATS instead of the original ones.
  • Figure 4: A special case of a GATS-based architecture that acts as a typical vision-to-text cross-attention model. Vision features (two red embeddings) are obtained in the beginning and are always visible to GATS, as its vision context is set to two. GATS language context is one and hence only the most recent language token is gathered. GATS steers the language model by updating the language activations and effectively conditions language processing on vision. Bold tokens indicate those processed (gathered, attended or scattered) by GATS when the most recent language token is processed. The figure shows an example with a single image, but interleaving text and images is a straightforward extension.
  • Figure 5: This figure showcases a GATS-based agent architecture leveraging pretrained language and video models. For clarity, we use small models with 6, 4, and 2 layers for language, video, and action respectively, and a single layer for the GATS model. Colors represent different modality, blue for language, red for vision and green for action. The workflow begins with a single text instruction processed once (the step has to be repeated if another instruction is given). Then, with each environment step, a video frame and proprioception inputs are fed into their respective unimodal models and are processed alongside previous tokens from the same modality (i.e., the unimodal models have contexts long enough to fit more than a single environment step). GATS gathers activations from the language model's 3rd layer, the video model's 2nd layer, and the action model's 1st layer, using them to steer further processing for the two latter models. Importantly, GATS only attends to recent activations, delegating long-term processing to the individual models, resulting in negligible computational overhead. This design allows for seamless scaling of both model size and GATS layers, making the architecture highly flexible and adaptable. Bold tokens indicate those processed (gathered, attended or scattered) by GATS for the most recent environment step, highlighting the efficient interaction between GATS and the unimodal models.
  • ...and 7 more figures