Table of Contents
Fetching ...

Adaptive Length Image Tokenization via Recurrent Allocation

Shivam Duggal, Phillip Isola, Antonio Torralba, William T. Freeman

TL;DR

This work proposes an approach to learn variable-length token representations for 2D images using an encoder-decoder architecture that recursively processes 2D image tokens, distilling them into 1D latent tokens over multiple iterations of recurrent rollouts.

Abstract

Current vision systems typically assign fixed-length representations to images, regardless of the information content. This contrasts with human intelligence - and even large language models - which allocate varying representational capacities based on entropy, context and familiarity. Inspired by this, we propose an approach to learn variable-length token representations for 2D images. Our encoder-decoder architecture recursively processes 2D image tokens, distilling them into 1D latent tokens over multiple iterations of recurrent rollouts. Each iteration refines the 2D tokens, updates the existing 1D latent tokens, and adaptively increases representational capacity by adding new tokens. This enables compression of images into a variable number of tokens, ranging from 32 to 256. We validate our tokenizer using reconstruction loss and FID metrics, demonstrating that token count aligns with image entropy, familiarity and downstream task requirements. Recurrent token processing with increasing representational capacity in each iteration shows signs of token specialization, revealing potential for object / part discovery.

Adaptive Length Image Tokenization via Recurrent Allocation

TL;DR

This work proposes an approach to learn variable-length token representations for 2D images using an encoder-decoder architecture that recursively processes 2D image tokens, distilling them into 1D latent tokens over multiple iterations of recurrent rollouts.

Abstract

Current vision systems typically assign fixed-length representations to images, regardless of the information content. This contrasts with human intelligence - and even large language models - which allocate varying representational capacities based on entropy, context and familiarity. Inspired by this, we propose an approach to learn variable-length token representations for 2D images. Our encoder-decoder architecture recursively processes 2D image tokens, distilling them into 1D latent tokens over multiple iterations of recurrent rollouts. Each iteration refines the 2D tokens, updates the existing 1D latent tokens, and adaptively increases representational capacity by adding new tokens. This enables compression of images into a variable number of tokens, ranging from 32 to 256. We validate our tokenizer using reconstruction loss and FID metrics, demonstrating that token count aligns with image entropy, familiarity and downstream task requirements. Recurrent token processing with increasing representational capacity in each iteration shows signs of token specialization, revealing potential for object / part discovery.

Paper Structure

This paper contains 34 sections, 2 equations, 18 figures, 4 tables.

Figures (18)

  • Figure 1: Adaptive Length Image Tokenization maps an image to multiple variable-length representations through a recurrent token allocation process, enabling task-specific sampling. We learn the tokenizer via image reconstruction as a self-supervised objective. While a compressed representation can be optimized for specific tasks (e.g., fewer tokens for "dog", "leaf", "grass" may suffice for a VLM task), reconstruction objective supports learning a universal, task-agnostic tokenizer.
  • Figure 2: Adaptive Length Image Tokenizer (ALIT): Given an image, we first convert it into 2D image tokens before applying the 2D $\rightarrow$ 1D latent distillation. ALIT recurrently distills 2D image tokens into variable 1D latent tokens, with each iteration adding new latent tokens and processing them with the existing 2D image tokens and the old latent tokens. Training focuses on reconstructing 2D image tokens through reverse distillation from latent 1D to masked 2D tokens. Based on token-reconstruction quality, we can optionally mask specific 2D tokens from further processing, enabling dynamic halting per token. Recurrent processing with Adaptive Memory leads to compressible representations, flexible tokenization & specialized tokens focusing on objects/parts.
  • Figure 3: Reconstruction Analysis on ImageNet-100: Our approach outperforms all baselines in terms of reconstruction loss (right). Comparing Row-1 (high complexity) and Row-2 (low complexity) demonstrates the effectiveness of adaptive tokenization. Even with fewer tokens, our reconstructions maintain reasonable global alignment with ground truth, with an expected loss in detail.
  • Figure 4: Compression vs. Information Entropy Hypothesis on the Out-of-Distribution People-Art Dataset: Adaptive tokenization enables analysis of the Low-Complexity Art Hypothesis by examining token requirements for images of varying complexity. The plot on the right clearly shows that as (human-annotated) image complexity increases, so does the need for more computational tokens. More complex images have higher L1 reconstruction loss at fewer token count.
  • Figure 5: Analysing Dataset Representation Capacity: We vary tokens per image using different Token Selection Criteria (TSC) -- Best Top-X Classification Accuracy (Left) and Depth Error $<$ Threshold (Right). We use GT class/depth maps for computing TSC classification/depth errors. We then evaluate Classification Accuracy (Task of Interest, TOI) on the dataset reconstructed using different TSCs. X-axis = TSC, Y-axis = Dataset Token Count, Marker-Color = TOI Perf.
  • ...and 13 more figures