Vision Transformers with Natural Language Semantics

Young Kyung Kim; J. Matías Di Martino; Guillermo Sapiro

Vision Transformers with Natural Language Semantics

Young Kyung Kim, J. Matías Di Martino, Guillermo Sapiro

TL;DR

This paper addresses the semantic insufficiency of standard Vision Transformer tokens by introducing Semantic Vision Transformer (sViT), which tokenizes images using semantic segments produced by a segmentation model (SAM). By integrating segment coordinates and sizes into positional embeddings and adding a background token, sViT achieves a natural inductive bias while preserving global context, leading to improved data efficiency and robustness, especially on non-object-centric datasets and under distribution shifts. The authors also introduce segment-level data augmentation and a gradient-based, segment-focused interpretability approach, demonstrating superior interpretability and tangible improvements (up to ~29% on certain datasets) compared with standard ViT, with some trade-offs in computational cost due to segmentation. Overall, sViT represents a principled step toward more interpretable and robust vision transformers by leveraging semantic tokens and segment-level processing, with practical implications for data-efficient training and out-of-distribution generalization.

Abstract

Tokens or patches within Vision Transformers (ViT) lack essential semantic information, unlike their counterparts in natural language processing (NLP). Typically, ViT tokens are associated with rectangular image patches that lack specific semantic context, making interpretation difficult and failing to effectively encapsulate information. We introduce a novel transformer model, Semantic Vision Transformers (sViT), which leverages recent progress on segmentation models to design novel tokenizer strategies. sViT effectively harnesses semantic information, creating an inductive bias reminiscent of convolutional neural networks while capturing global dependencies and contextual information within images that are characteristic of transformers. Through validation using real datasets, sViT demonstrates superiority over ViT, requiring less training data while maintaining similar or superior performance. Furthermore, sViT demonstrates significant superiority in out-of-distribution generalization and robustness to natural distribution shifts, attributed to its scale invariance semantic characteristic. Notably, the use of semantic tokens significantly enhances the model's interpretability. Lastly, the proposed paradigm facilitates the introduction of new and powerful augmentation techniques at the token (or segment) level, increasing training data diversity and generalization capabilities. Just as sentences are made of words, images are formed by semantic objects; our proposed methodology leverages recent progress in object segmentation and takes an important and natural step toward interpretable and robust vision transformers.

Vision Transformers with Natural Language Semantics

TL;DR

Abstract

Paper Structure (15 sections, 6 figures, 2 tables, 1 algorithm)

This paper contains 15 sections, 6 figures, 2 tables, 1 algorithm.

Introduction
Methods
sViT
Data Augmentation
Enhancing Model Interpretability
Experiments
Setup
Datasets
Training and Finetuning
Results
Interpretability
Robustness on natural distribution shift
Related Works
Limitation
Conclusion

Figures (6)

Figure 1: Example of non-semantic and semantic tokenization for NLP and CV. Tokens from non-semantic tokenization are hard to explain and understand, while it is easier to explain and understand tokens from semantic tokenization for both NLP and CV.
Figure 2: The example illustrates the differences between non-semantic and semantic tokenization on non-object-centric and object-centric images. As shown in the left column under non-semantic tokenization, patches containing women differ between non-object-centric and object-centric images, whereas they are similar or identical to when semantic tokenization is applied. This consistency is achievable because semantic tokenization segments based on object presence rather than uniform patch size.
Figure 3: Architectures of ViT and sViT. ViT first resizes an image to a fixed size and divides it into equal-sized patches, whereas sViT divides an image at the semantic segments level and resizes the segments into equal-sized patches. ViT incorporates positional embedding mapped from the order of patches, while sViT includes positional embedding mapped from the coordinates of the semantic bounding box and the sizes of the segments. Outside of these two fundamental differences, the transformers architecture remains the same.
Figure 4: Example of augmented images at the semantic segment level and the entire image level. Top-row images are data autmentation results using horizontal flip, cropping, and resizing techniques on the entire image. Bottom-row images are data autmentation results using the same techniques but now at the semantic segment level.
Figure 5: The first column shows the images used to evaluate the interpretability of ViT and sViT. Columns two to four represent the interpretable outcomes from applying Grad-CAM, Grad-CAM++, and HiRes-CAM on ViT, respectively. The fifth column shows the interpretable outcome for our method. The first three rows are from object-centric datasets and the last three rows are from non-object-centric datasets. The color coding indicates the level of importance. The importance escalates progressively through blue, green, yellow, and orange before reaching the peak at red. Only our proposed method consistently provides humanly interpretable and semantic results.
...and 1 more figures

Vision Transformers with Natural Language Semantics

TL;DR

Abstract

Vision Transformers with Natural Language Semantics

Authors

TL;DR

Abstract

Table of Contents

Figures (6)