Table of Contents
Fetching ...

CLIPPO: Image-and-Language Understanding from Pixels Only

Michael Tschannen, Basil Mustafa, Neil Houlsby

TL;DR

CLIPPO presents a unified, tokenizer-free Vision Transformer that processes both images and text rendered as images, trained with a single contrastive objective. Despite lacking modality-specific towers, CLIPPO achieves competitive image classification and retrieval performance relative to CLIP-style baselines and demonstrates meaningful language understanding and multilingual retrieval when augmented with text/text contrastive training. The study shows that co-training with sentence pairs improves language tasks (GLUE) but can slightly dampen some vision-language retrieval metrics, highlighting a trade-off in multi-objective optimization. Overall, the pixel-only approach simplifies data pipelines and tokenizer dependencies, enabling robust cross-modal capabilities and suggesting promising extensions to additional modalities and multilingual settings.

Abstract

Multimodal models are becoming increasingly effective, in part due to unified components, such as the Transformer architecture. However, multimodal models still often consist of many task- and modality-specific pieces and training procedures. For example, CLIP (Radford et al., 2021) trains independent text and image towers via a contrastive loss. We explore an additional unification: the use of a pure pixel-based model to perform image, text, and multimodal tasks. Our model is trained with contrastive loss alone, so we call it CLIP-Pixels Only (CLIPPO). CLIPPO uses a single encoder that processes both regular images and text rendered as images. CLIPPO performs image-based tasks such as retrieval and zero-shot image classification almost as well as CLIP-style models, with half the number of parameters and no text-specific tower or embedding. When trained jointly via image-text contrastive learning and next-sentence contrastive learning, CLIPPO can perform well on natural language understanding tasks, without any word-level loss (language modelling or masked language modelling), outperforming pixel-based prior work. Surprisingly, CLIPPO can obtain good accuracy in visual question answering, simply by rendering the question and image together. Finally, we exploit the fact that CLIPPO does not require a tokenizer to show that it can achieve strong performance on multilingual multimodal retrieval without modifications.

CLIPPO: Image-and-Language Understanding from Pixels Only

TL;DR

CLIPPO presents a unified, tokenizer-free Vision Transformer that processes both images and text rendered as images, trained with a single contrastive objective. Despite lacking modality-specific towers, CLIPPO achieves competitive image classification and retrieval performance relative to CLIP-style baselines and demonstrates meaningful language understanding and multilingual retrieval when augmented with text/text contrastive training. The study shows that co-training with sentence pairs improves language tasks (GLUE) but can slightly dampen some vision-language retrieval metrics, highlighting a trade-off in multi-objective optimization. Overall, the pixel-only approach simplifies data pipelines and tokenizer dependencies, enabling robust cross-modal capabilities and suggesting promising extensions to additional modalities and multilingual settings.

Abstract

Multimodal models are becoming increasingly effective, in part due to unified components, such as the Transformer architecture. However, multimodal models still often consist of many task- and modality-specific pieces and training procedures. For example, CLIP (Radford et al., 2021) trains independent text and image towers via a contrastive loss. We explore an additional unification: the use of a pure pixel-based model to perform image, text, and multimodal tasks. Our model is trained with contrastive loss alone, so we call it CLIP-Pixels Only (CLIPPO). CLIPPO uses a single encoder that processes both regular images and text rendered as images. CLIPPO performs image-based tasks such as retrieval and zero-shot image classification almost as well as CLIP-style models, with half the number of parameters and no text-specific tower or embedding. When trained jointly via image-text contrastive learning and next-sentence contrastive learning, CLIPPO can perform well on natural language understanding tasks, without any word-level loss (language modelling or masked language modelling), outperforming pixel-based prior work. Surprisingly, CLIPPO can obtain good accuracy in visual question answering, simply by rendering the question and image together. Finally, we exploit the fact that CLIPPO does not require a tokenizer to show that it can achieve strong performance on multilingual multimodal retrieval without modifications.
Paper Structure (50 sections, 13 figures, 11 tables)

This paper contains 50 sections, 13 figures, 11 tables.

Figures (13)

  • Figure 1: CLIP clip trains separate image and text encoders, each with a modality-specific preprocessing and embedding, on image/alt-text pairs with a contrastive objective. CLIPPO trains a pure pixel-based model with equivalent capabilities by rendering the alt-text as an image, encoding the resulting image pair using a shared vision encoder (in two separate forward passes), and applying same training objective as CLIP.
  • Figure 2: Results on the VQAv2 benchmark (test-dev set). In addition to CLIPPO and baselines produced in this work, we also compare to Pythia and MCAN models with ViT encoders from clip_vision_and_language_tasks_2022, and with comparably sized METER meter_2022 and ViLT kim2021vilt models. CLIPPO outperforms CLIP${}^*$ and 1T-CLIP clearly on "yes/no" questions and gets similar performance as task-specific models.
  • Figure 3: Tokenization efficiency analyzed in terms of the sequence length produced by a given method. CLIPPO produces smaller sequences for the majority of languages compared to 1T-CLIP with alternative tokenizers.
  • Figure 4: Zero-shot image/text retrieval performance on CrossModal3600 crossmodal3600. Although specialized (mc4) tokenizers can be leveraged to improve multilingual performance CLIPPO (dashed black line) broadly matches or exceeds comparable 1T-CLIP models trained with vocabulary size 32,000 (the word embeddings result in a 27% increase in parameter count compared to CLIPPO).
  • Figure 5: Visualization of the modality gap for CLIP${}^*$ and CLIPPO optionally trained with 25% C4 data. The visualization follows the analysis from modality_gap_2022 and shows embedded images (blue dots) and corresponding alt-text (orange dots) from the WebLI validation set, projected to the first two principal components of the validation data matrix. CLIPPO has a slightly smaller modality gap than CLIP${}^*$; co-training with C4 data strongly reduces the gap.
  • ...and 8 more figures