Table of Contents
Fetching ...

TIPS: Text-Image Pretraining with Spatial awareness

Kevis-Kokitsi Maninis, Kaifeng Chen, Soham Ghosh, Arjun Karpur, Koert Chen, Ye Xia, Bingyi Cao, Daniel Salz, Guangxing Han, Jan Dlabal, Dan Gnanapragasam, Mojtaba Seyedhosseini, Howard Zhou, Andre Araujo

TL;DR

TIPS tackles the weakness of image-text pretraining for dense spatial tasks by fusing synthetic captions with noisy captions and combining CLIP-style contrastive learning with self-distillation and masked image modeling in a ViT-based encoder. A dual-embedding scheme leverages both synthetic and noisy textual supervision to yield spatially aware and object-centric embeddings, while a teacher-student framework and masking losses promote patch-level coherence. The approach scales to large ViT-g models and a curated $117$M image-text dataset, achieving strong off-the-shelf performance across $8$ tasks and $16$ datasets, often surpassing baselines on dense tasks and matching or exceeding self-supervised methods on global tasks. The work provides a practical, frozen-feature multimodal backbone with broad applicability to dense prediction, retrieval, and zero-shot classification, and highlights the benefits of synthetic captions for spatial understanding.

Abstract

While image-text representation learning has become very popular in recent years, existing models tend to lack spatial awareness and have limited direct applicability for dense understanding tasks. For this reason, self-supervised image-only pretraining is still the go-to method for many dense vision applications (e.g. depth estimation, semantic segmentation), despite the lack of explicit supervisory signals. In this paper, we close this gap between image-text and self-supervised learning, by proposing a novel general-purpose image-text model, which can be effectively used off the shelf for dense and global vision tasks. Our method, which we refer to as Text-Image Pretraining with Spatial awareness (TIPS), leverages two simple and effective insights. First, on textual supervision: we reveal that replacing noisy web image captions by synthetically generated textual descriptions boosts dense understanding performance significantly, due to a much richer signal for learning spatially aware representations. We propose an adapted training method that combines noisy and synthetic captions, resulting in improvements across both dense and global understanding tasks. Second, on the learning technique: we propose to combine contrastive image-text learning with self-supervised masked image modeling, to encourage spatial coherence, unlocking substantial enhancements for downstream applications. Building on these two ideas, we scale our model using the transformer architecture, trained on a curated set of public images. Our experiments are conducted on 8 tasks involving 16 datasets in total, demonstrating strong off-the-shelf performance on both dense and global understanding, for several image-only and image-text tasks. Code and models are released at https://github.com/google-deepmind/tips.

TIPS: Text-Image Pretraining with Spatial awareness

TL;DR

TIPS tackles the weakness of image-text pretraining for dense spatial tasks by fusing synthetic captions with noisy captions and combining CLIP-style contrastive learning with self-distillation and masked image modeling in a ViT-based encoder. A dual-embedding scheme leverages both synthetic and noisy textual supervision to yield spatially aware and object-centric embeddings, while a teacher-student framework and masking losses promote patch-level coherence. The approach scales to large ViT-g models and a curated M image-text dataset, achieving strong off-the-shelf performance across tasks and datasets, often surpassing baselines on dense tasks and matching or exceeding self-supervised methods on global tasks. The work provides a practical, frozen-feature multimodal backbone with broad applicability to dense prediction, retrieval, and zero-shot classification, and highlights the benefits of synthetic captions for spatial understanding.

Abstract

While image-text representation learning has become very popular in recent years, existing models tend to lack spatial awareness and have limited direct applicability for dense understanding tasks. For this reason, self-supervised image-only pretraining is still the go-to method for many dense vision applications (e.g. depth estimation, semantic segmentation), despite the lack of explicit supervisory signals. In this paper, we close this gap between image-text and self-supervised learning, by proposing a novel general-purpose image-text model, which can be effectively used off the shelf for dense and global vision tasks. Our method, which we refer to as Text-Image Pretraining with Spatial awareness (TIPS), leverages two simple and effective insights. First, on textual supervision: we reveal that replacing noisy web image captions by synthetically generated textual descriptions boosts dense understanding performance significantly, due to a much richer signal for learning spatially aware representations. We propose an adapted training method that combines noisy and synthetic captions, resulting in improvements across both dense and global understanding tasks. Second, on the learning technique: we propose to combine contrastive image-text learning with self-supervised masked image modeling, to encourage spatial coherence, unlocking substantial enhancements for downstream applications. Building on these two ideas, we scale our model using the transformer architecture, trained on a curated set of public images. Our experiments are conducted on 8 tasks involving 16 datasets in total, demonstrating strong off-the-shelf performance on both dense and global understanding, for several image-only and image-text tasks. Code and models are released at https://github.com/google-deepmind/tips.

Paper Structure

This paper contains 21 sections, 2 equations, 8 figures, 13 tables.

Figures (8)

  • Figure 1: We introduce TIPS: Text-Image Pretraining with Spatial awareness. TIPS is a general-purpose image-text encoder model, which can be effectively used for dense and global understanding, in vision-only or vision-language tasks.
  • Figure 2: Block diagram of TIPS. From bottom to top: given an input image, we produce masked and cropped augmentations, along with synthetic descriptive captions from a captioner model. They are fed into the text and image encoders, along with the noisy web caption, and the output tokens are used in the losses. The contrastive loss makes use of the two captions, aligning them with two [CLS] tokens obtained from the image encoder. TIPS also employs self-distillation applied to the local crops and a masked image modeling loss applied to dense patch tokens, which encourage spatially-aware and discriminative image representations.
  • Figure 3: Example https://commons.wikimedia.org/wiki/File:Cadillac_Escalade07.jpg with noisy caption (top) and synthetic caption by PaliGemma beyer2024paligemma (bottom).
  • Figure 4: Qualitative dense prediction results. For a given image (first column), we illustrate the principal components of the predicted spatial features (column 2) . Depth (column 3) and normals (column 4) are trained on NYUD, and for semantic segmentation (last column) we used the model trained on ADE20k. All dense tasks used the DPT decoder, with a frozen image encoder. More qualitative results can be found in the appendix.
  • Figure 5: More qualitative results for dense prediction tasks. For a given image (first column), we illustrate the principal components of the predicted spatial features (column 2). Depth (column 3) and normals (column 4) are trained on NYUD, and for semantic segmentation (last column) we used the model trained on ADE20k. All dense tasks used the DPT decoder, while keeping the image encoder weights frozen.
  • ...and 3 more figures