Table of Contents
Fetching ...

V2Flow: Unifying Visual Tokenization and Large Language Model Vocabularies for Autoregressive Image Generation

Guiwei Zhang, Tianyu Zhang, Mohan Zhou, Yalong Bai, Biye Li

TL;DR

V2Flow tackles the challenge of aligning discrete visual tokens with pretrained LLM vocabularies to enable autoregressive image generation on top of LLMs. It introduces a flow-matching tokenizer that maps a latent normal distribution to a compact, one-dimensional visual token sequence aligned with the LLM vocabulary, coupled with a masked autoregressive rectified-flow decoder for high-fidelity reconstruction. The visual vocabulary resampler maps image content into soft categorical distributions over the LLM vocabulary, and the decoder refines tokens through a masked transformer to condition a velocity field used in rectified-flow sampling. Experiments on ImageNet and large-scale text–image data demonstrate competitive reconstruction against VQ-based methods and strong text-conditioned image generation when integrated with LLMs like LLaMA2-7B, highlighting V2Flow’s potential for unified autoregressive multimodal generation.

Abstract

We propose V2Flow, a novel tokenizer that produces discrete visual tokens capable of high-fidelity reconstruction, while ensuring structural and latent distribution alignment with the vocabulary space of large language models (LLMs). Leveraging this tight visual-vocabulary coupling, V2Flow enables autoregressive visual generation on top of existing LLMs. Our approach formulates visual tokenization as a flow-matching problem, aiming to learn a mapping from a standard normal prior to the continuous image distribution, conditioned on token sequences embedded within the LLMs vocabulary space. The effectiveness of V2Flow stems from two core designs. First, we propose a Visual Vocabulary resampler, which compresses visual data into compact token sequences, with each represented as a soft categorical distribution over LLM's vocabulary. This allows seamless integration of visual tokens into existing LLMs for autoregressive visual generation. Second, we present a masked autoregressive Rectified-Flow decoder, employing a masked transformer encoder-decoder to refine visual tokens into contextually enriched embeddings. These embeddings then condition a dedicated velocity field for precise reconstruction. Additionally, an autoregressive rectified-flow sampling strategy is incorporated, ensuring flexible sequence lengths while preserving competitive reconstruction quality. Extensive experiments show that V2Flow outperforms mainstream VQ-based tokenizers and facilitates autoregressive visual generation on top of existing. https://github.com/zhangguiwei610/V2Flow

V2Flow: Unifying Visual Tokenization and Large Language Model Vocabularies for Autoregressive Image Generation

TL;DR

V2Flow tackles the challenge of aligning discrete visual tokens with pretrained LLM vocabularies to enable autoregressive image generation on top of LLMs. It introduces a flow-matching tokenizer that maps a latent normal distribution to a compact, one-dimensional visual token sequence aligned with the LLM vocabulary, coupled with a masked autoregressive rectified-flow decoder for high-fidelity reconstruction. The visual vocabulary resampler maps image content into soft categorical distributions over the LLM vocabulary, and the decoder refines tokens through a masked transformer to condition a velocity field used in rectified-flow sampling. Experiments on ImageNet and large-scale text–image data demonstrate competitive reconstruction against VQ-based methods and strong text-conditioned image generation when integrated with LLMs like LLaMA2-7B, highlighting V2Flow’s potential for unified autoregressive multimodal generation.

Abstract

We propose V2Flow, a novel tokenizer that produces discrete visual tokens capable of high-fidelity reconstruction, while ensuring structural and latent distribution alignment with the vocabulary space of large language models (LLMs). Leveraging this tight visual-vocabulary coupling, V2Flow enables autoregressive visual generation on top of existing LLMs. Our approach formulates visual tokenization as a flow-matching problem, aiming to learn a mapping from a standard normal prior to the continuous image distribution, conditioned on token sequences embedded within the LLMs vocabulary space. The effectiveness of V2Flow stems from two core designs. First, we propose a Visual Vocabulary resampler, which compresses visual data into compact token sequences, with each represented as a soft categorical distribution over LLM's vocabulary. This allows seamless integration of visual tokens into existing LLMs for autoregressive visual generation. Second, we present a masked autoregressive Rectified-Flow decoder, employing a masked transformer encoder-decoder to refine visual tokens into contextually enriched embeddings. These embeddings then condition a dedicated velocity field for precise reconstruction. Additionally, an autoregressive rectified-flow sampling strategy is incorporated, ensuring flexible sequence lengths while preserving competitive reconstruction quality. Extensive experiments show that V2Flow outperforms mainstream VQ-based tokenizers and facilitates autoregressive visual generation on top of existing. https://github.com/zhangguiwei610/V2Flow

Paper Structure

This paper contains 12 sections, 7 equations, 6 figures, 3 tables, 1 algorithm.

Figures (6)

  • Figure 1: Highlights of V2Flow tokenizer. In (a), a visual vocabulary resampler compresses visual content into a compact one-dimensional token sequence. Each token is directly expressed within the latent distribution of existing LLMs vocabulary space, as illustrated in (b). This design facilitates autoregressive visual generation on top of existing LLMs. Subsequently, the quantized visual tokens condition on a masked autoregressive Rectified-Flow decoder for high-fidelity visual reconstruction.
  • Figure 2: Overview of the V2Flow tokenizer, including ➊ Visual Vocabulary Resampler and ➋ Masked Autoregressive Rectified-Flow decoder. The first component is designed to compress visual content into a compact, one-dimensional discrete token sequence that are directly expressed within existing LLMs vocabularies. This enables seamless autoregressive visual generation on top of existing LLMs. Furthermore, the Masked Autoregressive Rectified-Flow decoder refines quantized tokens through a masked transformer encoder-decoder, producing visually enriched embeddings. These embeddings condition a tailored velocity field model to reconstruct the underlying visual content. Finally, a rectified-flow sampling strategy with autoregressive prediction offers flexibility in sequence length while preserving competitive reconstruction performance.
  • Figure 3: Pipline for integrating V2Flow tokenizer with pretrained LLMs for autoregressive visual generation.
  • Figure 4: Qualitative comparisons of reconstruction quality on the ImageNet-1K test subset, comparing V2Flow against TiTok yu2024image at resolution 256×256 and the CosMos-Discrete agarwal2025cosmos at resolution 512×512. For resolution $256 \times 256$, both V2Flow and TiTok yu2024imagecompress the input image into an one-dimensional sequence of 256 tokens, yet V2Flow reconstructs images with finer details. At resolution $512 \times 512$, compared to CosMos which compresses input images into 2D grid latents, V2Flow still achieves superior reconstruction quality.
  • Figure 5: Qualitative results of text-conditioned image generation. We compare our approach against recent state-of-the-art autoregressive models, including Janus-Pro-7B chen2025janus and Lumina-mGPT-7B liu2024lumina. All images are generated at a resolution of 512×512.
  • ...and 1 more figures