Table of Contents
Fetching ...

Communication-Inspired Tokenization for Structured Image Representations

Aram Davtyan, Yusuf Sahin, Yasaman Haghighi, Sebastian Stapf, Pablo Acuaviva, Alexandre Alahi, Paolo Favaro

TL;DR

This work introduces COMmunication inspired Tokenization (COMiT), a framework for learning structured discrete visual token sequences and demonstrates that while semantic alignment provides grounding, attentive sequential tokenization is critical for inducing interpretable, object-centric token structure and substantially improving compositional generalization and relational reasoning over prior methods.

Abstract

Discrete image tokenizers have emerged as a key component of modern vision and multimodal systems, providing a sequential interface for transformer-based architectures. However, most existing approaches remain primarily optimized for reconstruction and compression, often yielding tokens that capture local texture rather than object-level semantic structure. Inspired by the incremental and compositional nature of human communication, we introduce COMmunication inspired Tokenization (COMiT), a framework for learning structured discrete visual token sequences. COMiT constructs a latent message within a fixed token budget by iteratively observing localized image crops and recurrently updating its discrete representation. At each step, the model integrates new visual information while refining and reorganizing the existing token sequence. After several encoding iterations, the final message conditions a flow-matching decoder that reconstructs the full image. Both encoding and decoding are implemented within a single transformer model and trained end-to-end using a combination of flow-matching reconstruction and semantic representation alignment losses. Our experiments demonstrate that while semantic alignment provides grounding, attentive sequential tokenization is critical for inducing interpretable, object-centric token structure and substantially improving compositional generalization and relational reasoning over prior methods.

Communication-Inspired Tokenization for Structured Image Representations

TL;DR

This work introduces COMmunication inspired Tokenization (COMiT), a framework for learning structured discrete visual token sequences and demonstrates that while semantic alignment provides grounding, attentive sequential tokenization is critical for inducing interpretable, object-centric token structure and substantially improving compositional generalization and relational reasoning over prior methods.

Abstract

Discrete image tokenizers have emerged as a key component of modern vision and multimodal systems, providing a sequential interface for transformer-based architectures. However, most existing approaches remain primarily optimized for reconstruction and compression, often yielding tokens that capture local texture rather than object-level semantic structure. Inspired by the incremental and compositional nature of human communication, we introduce COMmunication inspired Tokenization (COMiT), a framework for learning structured discrete visual token sequences. COMiT constructs a latent message within a fixed token budget by iteratively observing localized image crops and recurrently updating its discrete representation. At each step, the model integrates new visual information while refining and reorganizing the existing token sequence. After several encoding iterations, the final message conditions a flow-matching decoder that reconstructs the full image. Both encoding and decoding are implemented within a single transformer model and trained end-to-end using a combination of flow-matching reconstruction and semantic representation alignment losses. Our experiments demonstrate that while semantic alignment provides grounding, attentive sequential tokenization is critical for inducing interpretable, object-centric token structure and substantially improving compositional generalization and relational reasoning over prior methods.
Paper Structure (22 sections, 6 equations, 10 figures, 6 tables, 1 algorithm)

This paper contains 22 sections, 6 equations, 10 figures, 6 tables, 1 algorithm.

Figures (10)

  • Figure 1: The overall training pipeline of COMiT. A sequence of $K$ random crops is extracted from the input image and iteratively embedded into the latent message $m_K$ that is discretized via FSQ mentzerfinite. The latter is decoded by the same model using the flow matching objective lipmanflow. Additionally, we use REPA yu2024representation to speed up the training and SREPA to inject more semantic priors into the latent message.
  • Figure 2: The effect of our attentive tokenization pipeline on the tokens' visual grounding. The model that has been trained with attentive tokenization demonstrates much better token--object alignment compared to the variant of the model that has only seen global crops at training. The token sequences for both models were obtained from the 10th layer of COMiT-B, using the same cropping policy that embeds only the global crop.
  • Figure 3: The difference between the adaptive (top) and global+adaptive (bottom) cropping policies. In both cases COMiT aggregates the crops of the input image (the leftmost column) into the latent message and decodes it to obtain the reconstructed image (the rightmost column, 10 NFE with ${\rm CFG}=7.5$). The columns in-between show which crops are selected together with immediate single step reconstructions (1 NFE with ${\rm CFG}=1.0$). The progressive ambiguity reduction as more crops are integrated in the latent message is particularly visible with the single step decoding.
  • Figure 4: The way COMiT adds information to the latent message is inherently compositional.
  • Figure 5: (a) and (b): Ablation of the reconstruction fidelity under different sampling hyperparameters. (c): Ablation of the bottleck size with COMiT-B.
  • ...and 5 more figures