Table of Contents
Fetching ...

From Pixels to Tokens: Byte-Pair Encoding on Quantized Visual Modalities

Wanpeng Zhang, Zilong Xie, Yicheng Feng, Yijiang Li, Xingrun Xing, Sipeng Zheng, Zongqing Lu

TL;DR

This work introduces a BPE-based image tokenizer that merges VQ-GAN quantized image patches into semantically informative tokens to enable end-to-end token-based multimodal learning. The authors provide theoretical analysis showing that tokenization can bridge the gap between simple unigram models and fully expressive models for 2D image data, and they implement a preliminary training pipeline that integrates the tokenizer with a base LLM (Llama-3.1-8B) through a two-stage process. Empirical results across benchmarks (eg, VQAv2, MMBench, MME, POPE, VizWiz) demonstrate consistent gains from the tokenizer and show data scaling can further boost performance, despite training on substantially smaller datasets than CLIP-based encoders. The work suggests a scalable, data-efficient pathway toward more capable multimodal foundation models by aligning visual and textual representations at the token level, with Being-VL-0 serving as a proof of concept. Future work could extend the paradigm to video and larger-scale data while preserving the core benefit of explicit structural tokenization.

Abstract

Multimodal Large Language Models have made significant strides in integrating visual and textual information, yet they often struggle with effectively aligning these modalities. We introduce a novel image tokenizer that bridges this gap by applying the principle of Byte-Pair Encoding (BPE) to visual data. Unlike conventional approaches that rely on separate visual encoders, our method directly incorporates structural prior information into image tokens, mirroring the successful tokenization strategies used in text-only Large Language Models. This innovative approach enables Transformer models to more effectively learn and reason across modalities. Through theoretical analysis and extensive experiments, we demonstrate that our BPE Image Tokenizer significantly enhances MLLMs' multimodal understanding capabilities, even with limited training data. Leveraging this method, we develop Being-VL-0, a model that demonstrates superior performance across various benchmarks and shows promising scalability, potentially paving the way for more efficient and capable multimodal foundation models.

From Pixels to Tokens: Byte-Pair Encoding on Quantized Visual Modalities

TL;DR

This work introduces a BPE-based image tokenizer that merges VQ-GAN quantized image patches into semantically informative tokens to enable end-to-end token-based multimodal learning. The authors provide theoretical analysis showing that tokenization can bridge the gap between simple unigram models and fully expressive models for 2D image data, and they implement a preliminary training pipeline that integrates the tokenizer with a base LLM (Llama-3.1-8B) through a two-stage process. Empirical results across benchmarks (eg, VQAv2, MMBench, MME, POPE, VizWiz) demonstrate consistent gains from the tokenizer and show data scaling can further boost performance, despite training on substantially smaller datasets than CLIP-based encoders. The work suggests a scalable, data-efficient pathway toward more capable multimodal foundation models by aligning visual and textual representations at the token level, with Being-VL-0 serving as a proof of concept. Future work could extend the paradigm to video and larger-scale data while preserving the core benefit of explicit structural tokenization.

Abstract

Multimodal Large Language Models have made significant strides in integrating visual and textual information, yet they often struggle with effectively aligning these modalities. We introduce a novel image tokenizer that bridges this gap by applying the principle of Byte-Pair Encoding (BPE) to visual data. Unlike conventional approaches that rely on separate visual encoders, our method directly incorporates structural prior information into image tokens, mirroring the successful tokenization strategies used in text-only Large Language Models. This innovative approach enables Transformer models to more effectively learn and reason across modalities. Through theoretical analysis and extensive experiments, we demonstrate that our BPE Image Tokenizer significantly enhances MLLMs' multimodal understanding capabilities, even with limited training data. Leveraging this method, we develop Being-VL-0, a model that demonstrates superior performance across various benchmarks and shows promising scalability, potentially paving the way for more efficient and capable multimodal foundation models.
Paper Structure (33 sections, 3 theorems, 33 equations, 3 figures, 10 tables, 3 algorithms)

This paper contains 33 sections, 3 theorems, 33 equations, 3 figures, 10 tables, 3 algorithms.

Key Result

Proposition 1

For data generating processes described in either Scenario case:column-wise or Scenario case:row-wise, as $m\to\infty$, the optimal cross-entropy loss among unigram model family ${\mathcal{Q}}_{\operatorname{1-gram}}$ satisfies In contrast, the optimal unconstrained cross entropy loss satisfies

Figures (3)

  • Figure 1: Illustration of our BPE Image Tokenizer. The overall process begins with quantizing the image into initial token IDs. The BPE Image Tokenizer then combines these tokens based on learned patterns, similar to text tokenizers. This combination results in tokens that inherently contain more semantic information. The final tokenized sequence thus incorporates structural prior information from the image, enabling the Transformer model to deeper comprehend the alignment between visual and textual information during training. This approach facilitates more effective integration of visual data into MLLMs, enhancing their multimodal understanding capabilities.
  • Figure 2: Definition of 2D $k^{th}$-order Markov sequence data, and the performance of Transformer in learning such sequence data with or without tokenizer. For details of the hyperparameters used in the experiments, please refer to section \ref{['sec:implementation']}.
  • Figure 3: (Left) The relationship between model performance and the size of the BPE vocabulary. (Right) The visualization of model weights for tokens usage under different vocabularies.

Theorems & Definitions (9)

  • Definition 1: 2D $k^{th}$-order Markov process
  • Proposition 1
  • proof : Sketch of Proof
  • Remark 1
  • Proposition 2
  • proof : Sketch of Proof
  • proof : Proof of Proposition \ref{['prop:1']}
  • proof : Proof of Proposition \ref{['prop:tokenlower']}
  • Lemma A.1: Theorem 4.1 in rajaraman2024toward