GATS: Gather-Attend-Scatter
Konrad Zolna, Serkan Cabi, Yutian Chen, Eric Lau, Claudio Fantacci, Jurgis Pasukonis, Jost Tobias Springenberg, Sergio Gomez Colmenarejo
TL;DR
GATS presents a general-purpose module for unifying pretrained foundation models across modalities by gathering activations, applying cross-modal attention, and scattering updates back into component models without finetuning them. The method interleaves with vanilla transformers, enabling cross-modal conditioning through a proportional layer grafting strategy and a steering mechanism to control which models are actively updated. Demonstrations include Atari Pong, Language-Table, and YCB robotic tasks with frozen language and vision models, plus a bimodal text-image model showing effective captioning and generation with shared GATS parameters. This approach offers a scalable, training-efficient path to building multimodal agents and generators by composing diverse, frozen pretrained models.
Abstract
As the AI community increasingly adopts large-scale models, it is crucial to develop general and flexible tools to integrate them. We introduce Gather-Attend-Scatter (GATS), a novel module that enables seamless combination of pretrained foundation models, both trainable and frozen, into larger multimodal networks. GATS empowers AI systems to process and generate information across multiple modalities at different rates. In contrast to traditional fine-tuning, GATS allows for the original component models to remain frozen, avoiding the risk of them losing important knowledge acquired during the pretraining phase. We demonstrate the utility and versatility of GATS with a few experiments across games, robotics, and multimodal input-output systems.
