Table of Contents
Fetching ...

FlexTok: Resampling Images into 1D Token Sequences of Flexible Length

Roman Bachmann, Jesse Allardice, David Mizrahi, Enrico Fini, Oğuzhan Fatih Kar, Elmira Amirloo, Alaaeldin El-Nouby, Amir Zamir, Afshin Dehghan

TL;DR

FlexTok tackles the rigidity of fixed-length image tokenization by introducing a variable-length 1D token sequence that respects image complexity. It uses register tokens, a finite scalar FSQ bottleneck, and an end-to-end rectified flow decoder, trained with nested dropout and causal masks to produce ordered tokens for autoregressive generation. The approach yields strong ImageNet results with as few as 8 tokens for coarse conditioning and up to 256 tokens for detailed text-conditioned generation, effectively forming a coarse-to-fine visual vocabulary. This framework not only improves efficiency and scalability over prior 1D/2D tokenizers but also opens avenues for adaptive compute budgets and applications beyond static images, such as audio and video.

Abstract

Image tokenization has enabled major advances in autoregressive image generation by providing compressed, discrete representations that are more efficient to process than raw pixels. While traditional approaches use 2D grid tokenization, recent methods like TiTok have shown that 1D tokenization can achieve high generation quality by eliminating grid redundancies. However, these methods typically use a fixed number of tokens and thus cannot adapt to an image's inherent complexity. We introduce FlexTok, a tokenizer that projects 2D images into variable-length, ordered 1D token sequences. For example, a 256x256 image can be resampled into anywhere from 1 to 256 discrete tokens, hierarchically and semantically compressing its information. By training a rectified flow model as the decoder and using nested dropout, FlexTok produces plausible reconstructions regardless of the chosen token sequence length. We evaluate our approach in an autoregressive generation setting using a simple GPT-style Transformer. On ImageNet, this approach achieves an FID<2 across 8 to 128 tokens, outperforming TiTok and matching state-of-the-art methods with far fewer tokens. We further extend the model to support to text-conditioned image generation and examine how FlexTok relates to traditional 2D tokenization. A key finding is that FlexTok enables next-token prediction to describe images in a coarse-to-fine "visual vocabulary", and that the number of tokens to generate depends on the complexity of the generation task.

FlexTok: Resampling Images into 1D Token Sequences of Flexible Length

TL;DR

FlexTok tackles the rigidity of fixed-length image tokenization by introducing a variable-length 1D token sequence that respects image complexity. It uses register tokens, a finite scalar FSQ bottleneck, and an end-to-end rectified flow decoder, trained with nested dropout and causal masks to produce ordered tokens for autoregressive generation. The approach yields strong ImageNet results with as few as 8 tokens for coarse conditioning and up to 256 tokens for detailed text-conditioned generation, effectively forming a coarse-to-fine visual vocabulary. This framework not only improves efficiency and scalability over prior 1D/2D tokenizers but also opens avenues for adaptive compute budgets and applications beyond static images, such as audio and video.

Abstract

Image tokenization has enabled major advances in autoregressive image generation by providing compressed, discrete representations that are more efficient to process than raw pixels. While traditional approaches use 2D grid tokenization, recent methods like TiTok have shown that 1D tokenization can achieve high generation quality by eliminating grid redundancies. However, these methods typically use a fixed number of tokens and thus cannot adapt to an image's inherent complexity. We introduce FlexTok, a tokenizer that projects 2D images into variable-length, ordered 1D token sequences. For example, a 256x256 image can be resampled into anywhere from 1 to 256 discrete tokens, hierarchically and semantically compressing its information. By training a rectified flow model as the decoder and using nested dropout, FlexTok produces plausible reconstructions regardless of the chosen token sequence length. We evaluate our approach in an autoregressive generation setting using a simple GPT-style Transformer. On ImageNet, this approach achieves an FID<2 across 8 to 128 tokens, outperforming TiTok and matching state-of-the-art methods with far fewer tokens. We further extend the model to support to text-conditioned image generation and examine how FlexTok relates to traditional 2D tokenization. A key finding is that FlexTok enables next-token prediction to describe images in a coarse-to-fine "visual vocabulary", and that the number of tokens to generate depends on the complexity of the generation task.

Paper Structure

This paper contains 59 sections, 48 figures, 11 tables.

Figures (48)

  • Figure 1: Comparison of partial sequence generation: Raster-scan 2D-grid tokenizer vs. FlexTok.FlexTok resamples images into a 1D sequence of discrete tokens of flexible length, describing images in a coarse-to-fine manner. When training autoregressive (AR) models on FlexTok token sequences, the class conditioning (here "golden retriever") can be satisfied by generating as few as 8 tokens, whereas AR models trained on 2D tokenizer grids (here, LlamaGen sun2024autoregressive) need to always generate all tokens, no matter the complexity of the condition or image.
  • Figure 2: Reconstruction examples using FlexTok d18-d28 trained on DFN. Notice how most of the images' semantic and geometric content is captured by fewer than 16 tokens. The first tokens already capture the high-level semantic concepts (e.g., gray bird, people in colorful garments, mountain scene, yellow flower), while more tokens are required to reconstruct more intricate scene details (e.g., position and clothing of every person, brushstroke placement, etc.). To showcase out-of-distribution reconstruction, we generated the original images using Midjourney v6.1 midjourneyv61.
  • Figure 3: FlexTok overview.Stage 1: FlexTok resamples 2D VAE latents to a 1D sequence of discrete tokens using a ViT with registers Darcet2023Registers. The FSQ-quantized bottleneck mentzer2023fsq representation is used to condition a rectified flow model that decodes and reconstructs the original images. FlexTok learns ordered token sequences of flexible length by applying nested dropout Rippel2014NestedDropout on the register tokens. Stage 2: We train class- and text-conditional autoregressive Transformers to predict 1D token sequences in a coarse-to-fine manner. As more tokens are predicted, the generated image becomes more specific, encoding high-level concepts first (e.g., presence of a car) followed by finer details (e.g., car shape, brand, color).
  • Figure 4: Image reconstruction comparison between three different TiTok yu2024titok models, ALIT Duggal2024ALIT, and FlexTok. Compared to other 1D tokenizers, FlexTok is able to tokenize images in a highly semantic and ordered manner, all the way down to a single token, and all in a single model. For more visual comparisons, see \ref{['sec:app_tokens_vs_model_size_viz', 'sec:app_reconst_samples_viz', 'sec:app_reconst_comparison_viz']}.
  • Figure 5: FlexTok rate-distortion tradeoff. We show ImageNet-1k reconstruction metrics for three different FlexTok sizes. The more tokens used, the closer the reconstructions get to the original RGB images. Scaling the tokenizer size significantly improves reconstruction FID, but is not as crucial in terms of MAE and DreamSim score. For each of the different FlexTok model sizes we use the optimal inference hyperparameters detailed in \ref{['sec:app_inference_hparam_sweeps']}. We show additional reconstruction metrics in \ref{['tab:app_in1k_additional_reconst_metrics']}.
  • ...and 43 more figures