Table of Contents
Fetching ...

TULIP: Towards Unified Language-Image Pretraining

Zineng Tang, Long Lian, Seun Eisape, XuDong Wang, Roei Herzig, Adam Yala, Alane Suhr, Trevor Darrell, David M. Chan

TL;DR

TULIP introduces a unified, open-source pretraining framework that bridges vision-centric and language-grounded representations by combining patch-level contrastive views, diffusion-based generative augmentation, and reconstruction regularization. It extends the SigLIP framework with image-text, image-image, and text-text contrastive losses, plus a MAE/T5-based reconstruction objective, and employs GeCo to generate diverse positive and hard negative views via language and image editors. Trained on DataComp-1B with Recap-DataComp-1B and additional multi-view data, TULIP scales to over 1B parameters and achieves state-of-the-art zero-shot ImageNet-1K performance and strong gains on RxRx1 and MMVP benchmarks, while remaining a drop-in replacement for existing CIT models. The work shows that enriching contrastive views with generative augmentation and reconstruction can yield robust, fine-grained visual representations without sacrificing semantic alignment, enabling improved performance across vision, language, and multimodal tasks. It further provides extensive evaluations, ablations, and release-ready code and checkpoints to accelerate community adoption and further research in unified vision-language pretraining.

Abstract

Despite the recent success of image-text contrastive models like CLIP and SigLIP, these models often struggle with vision-centric tasks that demand high-fidelity image understanding, such as counting, depth estimation, and fine-grained object recognition. These models, by performing language alignment, tend to prioritize high-level semantics over visual understanding, weakening their image understanding. On the other hand, vision-focused models are great at processing visual information but struggle to understand language, limiting their flexibility for language-driven tasks. In this work, we introduce TULIP, an open-source, drop-in replacement for existing CLIP-like models. Our method leverages generative data augmentation, enhanced image-image and text-text contrastive learning, and image/text reconstruction regularization to learn fine-grained visual features while preserving global semantic alignment. Our approach, scaling to over 1B parameters, outperforms existing state-of-the-art (SOTA) models across multiple benchmarks, establishing a new SOTA zero-shot performance on ImageNet-1K, delivering up to a $2\times$ enhancement over SigLIP on RxRx1 in linear probing for few-shot classification, and improving vision-language models, achieving over $3\times$ higher scores than SigLIP on MMVP. Our code/checkpoints are available at https://tulip-berkeley.github.io

TULIP: Towards Unified Language-Image Pretraining

TL;DR

TULIP introduces a unified, open-source pretraining framework that bridges vision-centric and language-grounded representations by combining patch-level contrastive views, diffusion-based generative augmentation, and reconstruction regularization. It extends the SigLIP framework with image-text, image-image, and text-text contrastive losses, plus a MAE/T5-based reconstruction objective, and employs GeCo to generate diverse positive and hard negative views via language and image editors. Trained on DataComp-1B with Recap-DataComp-1B and additional multi-view data, TULIP scales to over 1B parameters and achieves state-of-the-art zero-shot ImageNet-1K performance and strong gains on RxRx1 and MMVP benchmarks, while remaining a drop-in replacement for existing CIT models. The work shows that enriching contrastive views with generative augmentation and reconstruction can yield robust, fine-grained visual representations without sacrificing semantic alignment, enabling improved performance across vision, language, and multimodal tasks. It further provides extensive evaluations, ablations, and release-ready code and checkpoints to accelerate community adoption and further research in unified vision-language pretraining.

Abstract

Despite the recent success of image-text contrastive models like CLIP and SigLIP, these models often struggle with vision-centric tasks that demand high-fidelity image understanding, such as counting, depth estimation, and fine-grained object recognition. These models, by performing language alignment, tend to prioritize high-level semantics over visual understanding, weakening their image understanding. On the other hand, vision-focused models are great at processing visual information but struggle to understand language, limiting their flexibility for language-driven tasks. In this work, we introduce TULIP, an open-source, drop-in replacement for existing CLIP-like models. Our method leverages generative data augmentation, enhanced image-image and text-text contrastive learning, and image/text reconstruction regularization to learn fine-grained visual features while preserving global semantic alignment. Our approach, scaling to over 1B parameters, outperforms existing state-of-the-art (SOTA) models across multiple benchmarks, establishing a new SOTA zero-shot performance on ImageNet-1K, delivering up to a enhancement over SigLIP on RxRx1 in linear probing for few-shot classification, and improving vision-language models, achieving over higher scores than SigLIP on MMVP. Our code/checkpoints are available at https://tulip-berkeley.github.io

Paper Structure

This paper contains 45 sections, 9 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: TULIP Overview. Existing contrastive image-text models struggle with high-fidelity visual understanding. TULIP is a drop-in replacement for CLIP which leverages generative data augmentation, global-local patch-wise image contrastive learning, and reconstruction-based feature regularization to learn robust visual features and fine-grained language grounding.
  • Figure 2: TULIP Image Encoder. Images undergo both traditional augmentations (such as cropping and color jittering) and generative augmentations via GeCo, which leverages large generative models to create semantically consistent or semantically altered views. These views are then used for image-image and image-text contrastive learning. Additionally, a masked autoencoder (MAE)-based reconstruction loss is applied to encourage the model to encode both semantic and fine-grained details.
  • Figure 3: TULIP Text Encoder. Text undergoes generative augmentation through paraphrasing and controlled semantic alterations using large language models, generating both positive and negative contrastive pairs. These pairs are used for both text-text and image-text contrastive learning with a SigLIP objective. Similar to image reconstruction, a causal decoder (based on T5) is used for text reconstruction, ensuring that the model retains both high-level semantics and fine-grained linguistic detail.
  • Figure 4: Overview of GeCo. Our generative augmentation framework leverages large generative models to create diverse contrastive views by generating both positive and negative augmentations for images and text. For text augmentation, we use Llama-3.1-8B-Instruct to generate paraphrases and semantically altered text variations. For image augmentation, we fine-tune an instruction-based image editing model (e.g., InstructPix2Pix) fine-tuned using soft-prompting to generate semantically consistent (positive) and semantically altered (negative) views.
  • Figure 5: (Top) GeCo generates positive and negative augmentations of both images and text, (Bottom) TULIP uses these augmentations during training time with corresponding weights (+1 for positive pair, -1 for negative pair, 0 to ignore).
  • ...and 3 more figures