Understanding the Effect of using Semantically Meaningful Tokens for Visual Representation Learning
Neha Kalibhat, Priyatham Kattakinda, Sumit Nawathe, Arman Zarei, Nikita Seleznev, Samuel Sharpe, Senthil Kumar, Soheil Feizi
TL;DR
This work challenges the standard patch-based tokenization of Vision Transformers by introducing semantically meaningful tokens—tangible object masks and intangible relationships—extracted with off-the-shelf segmentation and scene-graph models. A Visual Token Encoder is trained to produce image embeddings that align with CLIP caption embeddings using a contrastive objective, and is enhanced by additive attention that leverages relational and spatial metadata. Empirically, the approach yields substantial gains on COCO in text-to-image and image-to-text retrieval while also improving compositional reasoning on ARO and Winoground benchmarks, suggesting improved understanding of high-level semantic entities and relations. Overall, the paper demonstrates a promising direction for rethinking visual tokenization to enable more compositional and grounded visual representations with multimodal alignment benefits.
Abstract
Vision transformers have established a precedent of patchifying images into uniformly-sized chunks before processing. We hypothesize that this design choice may limit models in learning comprehensive and compositional representations from visual data. This paper explores the notion of providing semantically-meaningful visual tokens to transformer encoders within a vision-language pre-training framework. Leveraging off-the-shelf segmentation and scene-graph models, we extract representations of instance segmentation masks (referred to as tangible tokens) and relationships and actions (referred to as intangible tokens). Subsequently, we pre-train a vision-side transformer by incorporating these newly extracted tokens and aligning the resultant embeddings with caption embeddings from a text-side encoder. To capture the structural and semantic relationships among visual tokens, we introduce additive attention weights, which are used to compute self-attention scores. Our experiments on COCO demonstrate notable improvements over ViTs in learned representation quality across text-to-image (+47%) and image-to-text retrieval (+44%) tasks. Furthermore, we showcase the advantages on compositionality benchmarks such as ARO (+18%) and Winoground (+10%).
