V2Flow: Unifying Visual Tokenization and Large Language Model Vocabularies for Autoregressive Image Generation
Guiwei Zhang, Tianyu Zhang, Mohan Zhou, Yalong Bai, Biye Li
TL;DR
V2Flow tackles the challenge of aligning discrete visual tokens with pretrained LLM vocabularies to enable autoregressive image generation on top of LLMs. It introduces a flow-matching tokenizer that maps a latent normal distribution to a compact, one-dimensional visual token sequence aligned with the LLM vocabulary, coupled with a masked autoregressive rectified-flow decoder for high-fidelity reconstruction. The visual vocabulary resampler maps image content into soft categorical distributions over the LLM vocabulary, and the decoder refines tokens through a masked transformer to condition a velocity field used in rectified-flow sampling. Experiments on ImageNet and large-scale text–image data demonstrate competitive reconstruction against VQ-based methods and strong text-conditioned image generation when integrated with LLMs like LLaMA2-7B, highlighting V2Flow’s potential for unified autoregressive multimodal generation.
Abstract
We propose V2Flow, a novel tokenizer that produces discrete visual tokens capable of high-fidelity reconstruction, while ensuring structural and latent distribution alignment with the vocabulary space of large language models (LLMs). Leveraging this tight visual-vocabulary coupling, V2Flow enables autoregressive visual generation on top of existing LLMs. Our approach formulates visual tokenization as a flow-matching problem, aiming to learn a mapping from a standard normal prior to the continuous image distribution, conditioned on token sequences embedded within the LLMs vocabulary space. The effectiveness of V2Flow stems from two core designs. First, we propose a Visual Vocabulary resampler, which compresses visual data into compact token sequences, with each represented as a soft categorical distribution over LLM's vocabulary. This allows seamless integration of visual tokens into existing LLMs for autoregressive visual generation. Second, we present a masked autoregressive Rectified-Flow decoder, employing a masked transformer encoder-decoder to refine visual tokens into contextually enriched embeddings. These embeddings then condition a dedicated velocity field for precise reconstruction. Additionally, an autoregressive rectified-flow sampling strategy is incorporated, ensuring flexible sequence lengths while preserving competitive reconstruction quality. Extensive experiments show that V2Flow outperforms mainstream VQ-based tokenizers and facilitates autoregressive visual generation on top of existing. https://github.com/zhangguiwei610/V2Flow
