Table of Contents
Fetching ...

UniFlow: A Unified Pixel Flow Tokenizer for Visual Understanding and Generation

Zhengrong Yue, Haiyu Zhang, Xiangyu Zeng, Boyu Chen, Chenting Wang, Shaobin Zhuang, Lu Dong, KunPeng Du, Yi Wang, Limin Wang, Yali Wang

TL;DR

UniFlow tackles the long-standing trade-off between visual understanding and high-fidelity generation by combining a layer-wise adaptive self-distillation strategy with a lightweight patch-wise pixel flow decoder. By preserving hierarchical semantic knowledge from a frozen teacher encoder while enabling fine-grained reconstruction through a globally informed pixel flow, UniFlow achieves strong performance on both understanding and generation across 13 benchmarks. The approach yields state-of-the-art results among unified tokenizers and competitive generation quality, with notable training efficiency due to patch-level decoding and a compact decoder. This work significantly advances universal visual modeling by enabling robust, efficient, and versatile tokenization suitable for multimodal tasks and downstream vision-language systems.

Abstract

Tokenizer is a crucial component for both visual understanding and generation. To advance toward the ultimate goal of universal modeling, recent research has focused on developing a unified tokenizer. However, existing tokenizers face a significant performance trade-off between understanding and generation, stemming from the inherent conflict between high-level semantic abstraction and low-level pixel reconstruction. To tackle this challenge, we propose a generic and unified tokenizer, namely UniFlow, by flexibly adapting any visual encoder with a concise reconstruction decoder. Specifically, we introduce layer-wise adaptive self-distillation applied to the well-pretrained visual encoders, which enables UniFlow to simultaneously inherit the strong semantic features for visual understanding and flexibly adapt to model fine-grained details for visual generation. Moreover, we propose a lightweight patch-wise pixel flow decoder, which efficiently achieves high-fidelity pixel reconstruction by modeling a conditional flow from the noisy state back to the patch-wise pixel domain. By leveraging the semantic features as visual conditions for the decoder, we effectively alleviate the training conflicts between understanding and generation. Furthermore, the patch-wise learning strategy simplifies the data distribution, thereby improving training efficiency. Extensive experiments across 13 challenging benchmarks spanning 7 widely studied visual understanding and generation tasks demonstrate that UniFlow achieves a win-win outcome. For instance, our 7B UniFlow-XL not only surpasses the 14B TokenFlow-XL by 7.75% on average understanding benchmarks, but also achieves competitive results in both visual reconstruction and generation, surpassing UniTok by 0.15 in rFID and 0.09 in gFID (without guidance), respectively.

UniFlow: A Unified Pixel Flow Tokenizer for Visual Understanding and Generation

TL;DR

UniFlow tackles the long-standing trade-off between visual understanding and high-fidelity generation by combining a layer-wise adaptive self-distillation strategy with a lightweight patch-wise pixel flow decoder. By preserving hierarchical semantic knowledge from a frozen teacher encoder while enabling fine-grained reconstruction through a globally informed pixel flow, UniFlow achieves strong performance on both understanding and generation across 13 benchmarks. The approach yields state-of-the-art results among unified tokenizers and competitive generation quality, with notable training efficiency due to patch-level decoding and a compact decoder. This work significantly advances universal visual modeling by enabling robust, efficient, and versatile tokenization suitable for multimodal tasks and downstream vision-language systems.

Abstract

Tokenizer is a crucial component for both visual understanding and generation. To advance toward the ultimate goal of universal modeling, recent research has focused on developing a unified tokenizer. However, existing tokenizers face a significant performance trade-off between understanding and generation, stemming from the inherent conflict between high-level semantic abstraction and low-level pixel reconstruction. To tackle this challenge, we propose a generic and unified tokenizer, namely UniFlow, by flexibly adapting any visual encoder with a concise reconstruction decoder. Specifically, we introduce layer-wise adaptive self-distillation applied to the well-pretrained visual encoders, which enables UniFlow to simultaneously inherit the strong semantic features for visual understanding and flexibly adapt to model fine-grained details for visual generation. Moreover, we propose a lightweight patch-wise pixel flow decoder, which efficiently achieves high-fidelity pixel reconstruction by modeling a conditional flow from the noisy state back to the patch-wise pixel domain. By leveraging the semantic features as visual conditions for the decoder, we effectively alleviate the training conflicts between understanding and generation. Furthermore, the patch-wise learning strategy simplifies the data distribution, thereby improving training efficiency. Extensive experiments across 13 challenging benchmarks spanning 7 widely studied visual understanding and generation tasks demonstrate that UniFlow achieves a win-win outcome. For instance, our 7B UniFlow-XL not only surpasses the 14B TokenFlow-XL by 7.75% on average understanding benchmarks, but also achieves competitive results in both visual reconstruction and generation, surpassing UniTok by 0.15 in rFID and 0.09 in gFID (without guidance), respectively.

Paper Structure

This paper contains 52 sections, 6 equations, 12 figures, 12 tables.

Figures (12)

  • Figure 1: Comparison of different training paradigms for unified tokenizers. All multimodal large language models are trained on LLaVA-v1.5 data with Vicuna-7B, except that TokenFlow uses Vicuna-13B. UniFlow simultaneously improves performance and training efficiency.
  • Figure 2: The framework of UniFlow. Our UniFlow model is trained end-to-end to endow a powerful VFM with both semantic understanding capabilities and high-fidelity pixel reconstruction.
  • Figure 3: Various downstream tasks demonstrate UniFlow's robust visual representation.
  • Figure 4: Ablation studies on training comparison and hyperparameters.
  • Figure 5: Qualitative analysis of representations.(a)VQA: demonstrates UniFlow's superior understanding of detailed concepts. (b)t-SNE: UniFlow generates more semantically coherent clusters than InternViT and SD-VAE XL. (c)PCA: UniFlow maintains richer spatial information with clearer object contours.
  • ...and 7 more figures