T5Gemma 2: Seeing, Reading, and Understanding Longer

Biao Zhang; Paul Suganthan; Gaël Liu; Ilya Philippov; Sahil Dua; Ben Hora; Kat Black; Gus Martins; Omar Sanseviero; Shreya Pathak; Cassidy Hardin; Francesco Visin; Jiageng Zhang; Kathleen Kenealy; Qin Yin; Xiaodan Song; Olivier Lacombe; Armand Joulin; Tris Warkentin; Adam Roberts

T5Gemma 2: Seeing, Reading, and Understanding Longer

Biao Zhang, Paul Suganthan, Gaël Liu, Ilya Philippov, Sahil Dua, Ben Hora, Kat Black, Gus Martins, Omar Sanseviero, Shreya Pathak, Cassidy Hardin, Francesco Visin, Jiageng Zhang, Kathleen Kenealy, Qin Yin, Xiaodan Song, Olivier Lacombe, Armand Joulin, Tris Warkentin, Adam Roberts

TL;DR

The paper tackles the gap of long-context, multimodal understanding in encoder-decoder architectures by adapting a decoder-only Gemma 3 into T5Gemma 2 using the UL2 objective. It introduces two efficiency techniques—tied embeddings and merged attention—and evaluates three model sizes pretrained on ~2 trillion tokens with vision inputs via a frozen SigLIP encoder, achieving strong multimodal and long-context performance up to 128K tokens. The work demonstrates that text-pretrained, decoder-focused models can be effectively repurposed into encoder-decoder systems with competitive pretraining and improved post-training results compared to Gemma 3, while maintaining open checkpoints. This suggests encoder-decoder configurations offer tangible advantages for multimodal, long-context tasks and provides a practical foundation for downstream embedding and retrieval applications in open settings.

Abstract

We introduce T5Gemma 2, the next generation of the T5Gemma family of lightweight open encoder-decoder models, featuring strong multilingual, multimodal and long-context capabilities. T5Gemma 2 follows the adaptation recipe (via UL2) in T5Gemma -- adapting a pretrained decoder-only model into an encoder-decoder model, and extends it from text-only regime to multimodal based on the Gemma 3 models. We further propose two methods to improve the efficiency: tied word embedding that shares all embeddings across encoder and decoder, and merged attention that unifies decoder self- and cross-attention into a single joint module. Experiments demonstrate the generality of the adaptation strategy over architectures and modalities as well as the unique strength of the encoder-decoder architecture on long context modeling. Similar to T5Gemma, T5Gemma 2 yields comparable or better pretraining performance and significantly improved post-training performance than its Gemma 3 counterpart. We release the pretrained models (270M-270M, 1B-1B and 4B-4B) to the community for future research.

T5Gemma 2: Seeing, Reading, and Understanding Longer

TL;DR

Abstract

T5Gemma 2: Seeing, Reading, and Understanding Longer

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (2)