Table of Contents
Fetching ...

Understanding the Effect of using Semantically Meaningful Tokens for Visual Representation Learning

Neha Kalibhat, Priyatham Kattakinda, Sumit Nawathe, Arman Zarei, Nikita Seleznev, Samuel Sharpe, Senthil Kumar, Soheil Feizi

TL;DR

This work challenges the standard patch-based tokenization of Vision Transformers by introducing semantically meaningful tokens—tangible object masks and intangible relationships—extracted with off-the-shelf segmentation and scene-graph models. A Visual Token Encoder is trained to produce image embeddings that align with CLIP caption embeddings using a contrastive objective, and is enhanced by additive attention that leverages relational and spatial metadata. Empirically, the approach yields substantial gains on COCO in text-to-image and image-to-text retrieval while also improving compositional reasoning on ARO and Winoground benchmarks, suggesting improved understanding of high-level semantic entities and relations. Overall, the paper demonstrates a promising direction for rethinking visual tokenization to enable more compositional and grounded visual representations with multimodal alignment benefits.

Abstract

Vision transformers have established a precedent of patchifying images into uniformly-sized chunks before processing. We hypothesize that this design choice may limit models in learning comprehensive and compositional representations from visual data. This paper explores the notion of providing semantically-meaningful visual tokens to transformer encoders within a vision-language pre-training framework. Leveraging off-the-shelf segmentation and scene-graph models, we extract representations of instance segmentation masks (referred to as tangible tokens) and relationships and actions (referred to as intangible tokens). Subsequently, we pre-train a vision-side transformer by incorporating these newly extracted tokens and aligning the resultant embeddings with caption embeddings from a text-side encoder. To capture the structural and semantic relationships among visual tokens, we introduce additive attention weights, which are used to compute self-attention scores. Our experiments on COCO demonstrate notable improvements over ViTs in learned representation quality across text-to-image (+47%) and image-to-text retrieval (+44%) tasks. Furthermore, we showcase the advantages on compositionality benchmarks such as ARO (+18%) and Winoground (+10%).

Understanding the Effect of using Semantically Meaningful Tokens for Visual Representation Learning

TL;DR

This work challenges the standard patch-based tokenization of Vision Transformers by introducing semantically meaningful tokens—tangible object masks and intangible relationships—extracted with off-the-shelf segmentation and scene-graph models. A Visual Token Encoder is trained to produce image embeddings that align with CLIP caption embeddings using a contrastive objective, and is enhanced by additive attention that leverages relational and spatial metadata. Empirically, the approach yields substantial gains on COCO in text-to-image and image-to-text retrieval while also improving compositional reasoning on ARO and Winoground benchmarks, suggesting improved understanding of high-level semantic entities and relations. Overall, the paper demonstrates a promising direction for rethinking visual tokenization to enable more compositional and grounded visual representations with multimodal alignment benefits.

Abstract

Vision transformers have established a precedent of patchifying images into uniformly-sized chunks before processing. We hypothesize that this design choice may limit models in learning comprehensive and compositional representations from visual data. This paper explores the notion of providing semantically-meaningful visual tokens to transformer encoders within a vision-language pre-training framework. Leveraging off-the-shelf segmentation and scene-graph models, we extract representations of instance segmentation masks (referred to as tangible tokens) and relationships and actions (referred to as intangible tokens). Subsequently, we pre-train a vision-side transformer by incorporating these newly extracted tokens and aligning the resultant embeddings with caption embeddings from a text-side encoder. To capture the structural and semantic relationships among visual tokens, we introduce additive attention weights, which are used to compute self-attention scores. Our experiments on COCO demonstrate notable improvements over ViTs in learned representation quality across text-to-image (+47%) and image-to-text retrieval (+44%) tasks. Furthermore, we showcase the advantages on compositionality benchmarks such as ARO (+18%) and Winoground (+10%).
Paper Structure (17 sections, 1 equation, 4 figures, 2 tables)

This paper contains 17 sections, 1 equation, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Training with meaningful visual tokens: We present a framework that uses off-the-shelf segmentation and relation extraction models to prepare a set of tangible tokens ($\mathcal{V}$) and intangible tokens ($\mathcal{U}$) for any arbitrary image, along with directional semantic relationships between them. These tokens and image features ($\mathbf{l}$) are then passed as input to our visual token encoder ($f(.)$). We utilize the semantic relation ($\mathcal{E}$) and relative location ($\mathcal{N}$) information of all tokens to compute additive attention weights, ranked by importance. The learned image embeddings ($\mathbf{s}$) are contrastively aligned with the text embeddings ($\mathbf{t}$) of the CLIP text encoder ($g(.)$), which is simultaneously fine-tuned with our model.
  • Figure 2: Processing of text vs image data: In this simple illustration, we demonstrate the notable difference in how text and visual data are processed by humans and transformers. Humans are capable of deciphering larger concepts from images (both tangible and intangible), where each concept has independent semantic meaning.
  • Figure 3: Using off-the-shelf models to extract tokens: We extract image features ($\mathbf{l}$) and mask embeddings ($\mathcal{V}$) from a panoptic segmentation model. Next, we pass pairs of object masks to a relation extractor and collect the highly probable relationships ($\mathcal{E}$). We compute CLIP text embeddings of all relationships ($\mathcal{U}$). This information is distilled into a scene graph representing the image as shown.
  • Figure 4: Learned Representations: In text-to-image and image-to-text retrieval accuracy, we observe that our visual token encoders perform best beating both CLIP (fine-tuned) and ViT-s/16 baselines. We also show the average diagonal and off-diagonal similarity of the learned representations across training iterations. From these plots, we observe that the contrast is strongest for our visual token encoder when additive attention is not used.