Table of Contents
Fetching ...

UniToken: Harmonizing Multimodal Understanding and Generation through Unified Visual Encoding

Yang Jiao, Haibo Qiu, Zequn Jie, Shaoxiang Chen, Jingjing Chen, Lin Ma, Yu-Gang Jiang

TL;DR

UniToken introduces a unified visual encoding that blends discrete and continuous representations to bridge multimodal understanding and image generation within a single model. By employing a dual visual encoder and a shared language-model backbone, the approach achieves state-of-the-art or competitive performance across a wide range of benchmarks, while providing insights into task interference and data distribution. The training regime spans staged pretraining, joint understanding and generation optimization, and high-quality multimodal conversations to enhance instruction-following capabilities. This work offers a practical foundation for future unified multimodal models that flexibly leverage both high-level semantics and low-level visual details.

Abstract

We introduce UniToken, an auto-regressive generation model that encodes visual inputs through a combination of discrete and continuous representations, enabling seamless integration of unified visual understanding and image generation tasks. Unlike previous approaches that rely on unilateral visual representations, our unified visual encoding framework captures both high-level semantics and low-level details, delivering multidimensional information that empowers heterogeneous tasks to selectively assimilate domain-specific knowledge based on their inherent characteristics. Through in-depth experiments, we uncover key principles for developing a unified model capable of both visual understanding and image generation. Extensive evaluations across a diverse range of prominent benchmarks demonstrate that UniToken achieves state-of-the-art performance, surpassing existing approaches. These results establish UniToken as a robust foundation for future research in this domain. The code and models are available at https://github.com/SxJyJay/UniToken.

UniToken: Harmonizing Multimodal Understanding and Generation through Unified Visual Encoding

TL;DR

UniToken introduces a unified visual encoding that blends discrete and continuous representations to bridge multimodal understanding and image generation within a single model. By employing a dual visual encoder and a shared language-model backbone, the approach achieves state-of-the-art or competitive performance across a wide range of benchmarks, while providing insights into task interference and data distribution. The training regime spans staged pretraining, joint understanding and generation optimization, and high-quality multimodal conversations to enhance instruction-following capabilities. This work offers a practical foundation for future unified multimodal models that flexibly leverage both high-level semantics and low-level visual details.

Abstract

We introduce UniToken, an auto-regressive generation model that encodes visual inputs through a combination of discrete and continuous representations, enabling seamless integration of unified visual understanding and image generation tasks. Unlike previous approaches that rely on unilateral visual representations, our unified visual encoding framework captures both high-level semantics and low-level details, delivering multidimensional information that empowers heterogeneous tasks to selectively assimilate domain-specific knowledge based on their inherent characteristics. Through in-depth experiments, we uncover key principles for developing a unified model capable of both visual understanding and image generation. Extensive evaluations across a diverse range of prominent benchmarks demonstrate that UniToken achieves state-of-the-art performance, surpassing existing approaches. These results establish UniToken as a robust foundation for future research in this domain. The code and models are available at https://github.com/SxJyJay/UniToken.

Paper Structure

This paper contains 16 sections, 2 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Different visual encoding paradigms for developing a unified model for visual understanding and image generation. We use orange to denote components related to discrete visual encoding and red to signify components associated with continuous visual encoding. For the sake of brevity, we omit the text tokens in the image.
  • Figure 2: Illustration of (a) the overall framework of UniToken and (b) detailed designs of the unified dual encoder presented in (a). In (a), the "image detokenizer" and "text detokenizer" are responsible for converting predicted token IDs back into images and words, respectively. In (b), the VQ-GAN encoder processes an image and outputs discretized token IDs, which are then transformed into high-dimensional embeddings by indexing the LLM's vocabulary.
  • Figure 3: The question answering results of UniToken. Different types of questions, both in English and Chinese, are evaluated using our UniToken. Hallucinations in the responses are highlighted in red.
  • Figure 4: Comparison of image generation results between UniToken and Janus-Pro-7B.