Table of Contents
Fetching ...

Differentiable Hierarchical Visual Tokenization

Marius Aasan, Martine Hjelkrem-Tan, Nico Catalano, Changkyu Choi, Adín Ramírez Rivera

TL;DR

This work introduces an end-to-end differentiable tokenizer that adapts to image content with pixel-level granularity while remaining backward-compatible with existing architectures for retrofitting pretrained models.

Abstract

Vision Transformers rely on fixed patch tokens that ignore the spatial and semantic structure of images. In this work, we introduce an end-to-end differentiable tokenizer that adapts to image content with pixel-level granularity while remaining backward-compatible with existing architectures for retrofitting pretrained models. Our method uses hierarchical model selection with information criteria to provide competitive performance in both image-level classification and dense-prediction tasks, and even supports out-of-the-box raster-to-vector conversion.

Differentiable Hierarchical Visual Tokenization

TL;DR

This work introduces an end-to-end differentiable tokenizer that adapts to image content with pixel-level granularity while remaining backward-compatible with existing architectures for retrofitting pretrained models.

Abstract

Vision Transformers rely on fixed patch tokens that ignore the spatial and semantic structure of images. In this work, we introduce an end-to-end differentiable tokenizer that adapts to image content with pixel-level granularity while remaining backward-compatible with existing architectures for retrofitting pretrained models. Our method uses hierarchical model selection with information criteria to provide competitive performance in both image-level classification and dense-prediction tasks, and even supports out-of-the-box raster-to-vector conversion.

Paper Structure

This paper contains 28 sections, 3 theorems, 17 equations, 14 figures, 12 tables, 2 algorithms.

Key Result

Theorem A.10

Let $V$ be a set, and let ${\sim}$ be an equivalence relation on $V$. Then $V / {\sim} \in \Pi(V)$.

Figures (14)

  • Figure 1: Comparing spatial granularity in visual tokenizers. dHT (right) provides an end-to-end learnable framework for multi-scale tokenization. We provide more examples in \ref{['fig:spatgran']}.
  • Figure 2: Illustration of the dHT tokenization and feature extraction pipeline. From an input image we produce a hierarchy of superpixel representations. An optimal segmentation is then selected from the hierarchy using information criteria, and features are extracted for each superpixel. Features can then be used in any ViT backbone. (Right) We also depict the feature extraction process of a superpixel $S$ where its features are mixed based on foreground, $M^+$, background, $M^-$, and shared background features, $\beta$. Details in \ref{['sec:sp_diff']}.
  • Figure 3: Single Scale Semantic Segmentation mIoU results on ADE20k ade20k and COCO-Stuff164k coco.
  • Figure 4: Comparison of image vectorization with DiffVG diffvg (left) and dHT (right).
  • Figure 5: Image vectorization from dHT token extraction with zooms to show the finer details, comparing the original image (left) and vectorized image (right).
  • ...and 9 more figures

Theorems & Definitions (14)

  • Definition A.1: Neighborhood
  • Definition A.2: Subgraphs
  • Definition A.3: Graph Connectivity
  • Definition A.4: Reachability
  • Definition A.5: Equivalence Relations and Classes
  • Definition A.6: Quotient Set
  • Definition A.7: Partition of Sets
  • Definition A.8: Refinement of Partitions
  • Definition A.9: Hierarchy of Partitions
  • Theorem A.10: Fundamental Theorem on Equivalence Relations
  • ...and 4 more