Table of Contents
Fetching ...

UniCode: Learning a Unified Codebook for Multimodal Large Language Models

Sipeng Zheng, Bohan Zhou, Yicheng Feng, Ye Wang, Zongqing Lu

TL;DR

UniCode addresses the limitation of text-only codebooks in multimodal LLMs by learning a unified codebook that tokenizes language and vision with a VAE-style visual tokenizer integrated into the LLM. The core method uses an EMA-based alignment $\\mathbb{C}' = \lambda \\mathbb{C} + (1-\\lambda)\\mathbb{C}_L$ and an in-context image decompression pretraining objective to reconstruct multi-layer code maps, complemented by a negative log-likelihood loss $\\mathcal{L}(\\Theta)=-\\sum_{j=1}^{L}\\log P_{\\Theta}(y_j|\\mathcal{I}, \\hat{y}_{1:j-1})$ over answer tokens. Training proceeds in two stages: Stage I to unify the codebook and Stage II to perform multimodal instruction tuning without adding extra alignment modules. Experiments show competitive performance on VQA, image generation, and reconstruction with substantially fewer parameters and data, and gains when using UniCode+ with larger encoders, indicating a scalable path to practical multimodal I/O and instruction-following capabilities.

Abstract

In this paper, we propose \textbf{UniCode}, a novel approach within the domain of multimodal large language models (MLLMs) that learns a unified codebook to efficiently tokenize visual, text, and potentially other types of signals. This innovation addresses a critical limitation in existing MLLMs: their reliance on a text-only codebook, which restricts MLLM's ability to generate images and texts in a multimodal context. Towards this end, we propose a language-driven iterative training paradigm, coupled with an in-context pre-training task we term ``image decompression'', enabling our model to interpret compressed visual data and generate high-quality images.The unified codebook empowers our model to extend visual instruction tuning to non-linguistic generation tasks. Moreover, UniCode is adaptable to diverse stacked quantization approaches in order to compress visual signals into a more compact token representation. Despite using significantly fewer parameters and less data during training, Unicode demonstrates promising capabilities in visual reconstruction and generation. It also achieves performances comparable to leading MLLMs across a spectrum of VQA benchmarks.

UniCode: Learning a Unified Codebook for Multimodal Large Language Models

TL;DR

UniCode addresses the limitation of text-only codebooks in multimodal LLMs by learning a unified codebook that tokenizes language and vision with a VAE-style visual tokenizer integrated into the LLM. The core method uses an EMA-based alignment and an in-context image decompression pretraining objective to reconstruct multi-layer code maps, complemented by a negative log-likelihood loss over answer tokens. Training proceeds in two stages: Stage I to unify the codebook and Stage II to perform multimodal instruction tuning without adding extra alignment modules. Experiments show competitive performance on VQA, image generation, and reconstruction with substantially fewer parameters and data, and gains when using UniCode+ with larger encoders, indicating a scalable path to practical multimodal I/O and instruction-following capabilities.

Abstract

In this paper, we propose \textbf{UniCode}, a novel approach within the domain of multimodal large language models (MLLMs) that learns a unified codebook to efficiently tokenize visual, text, and potentially other types of signals. This innovation addresses a critical limitation in existing MLLMs: their reliance on a text-only codebook, which restricts MLLM's ability to generate images and texts in a multimodal context. Towards this end, we propose a language-driven iterative training paradigm, coupled with an in-context pre-training task we term ``image decompression'', enabling our model to interpret compressed visual data and generate high-quality images.The unified codebook empowers our model to extend visual instruction tuning to non-linguistic generation tasks. Moreover, UniCode is adaptable to diverse stacked quantization approaches in order to compress visual signals into a more compact token representation. Despite using significantly fewer parameters and less data during training, Unicode demonstrates promising capabilities in visual reconstruction and generation. It also achieves performances comparable to leading MLLMs across a spectrum of VQA benchmarks.
Paper Structure (16 sections, 6 equations, 5 figures, 11 tables)

This paper contains 16 sections, 6 equations, 5 figures, 11 tables.

Figures (5)

  • Figure 1: Three paradigms of MLLMs: (a) vis enc+text tok incorporates a lightweight module to align visual signals with the LLM, specifically designed for languge generation; (b) vis tok+text tok concatenates the text codebook with quantized visual tokens, significantly increasing the computational cost and complexity; (c) unified tok learns a unified codebook to interpret both visual and text modalities without additional modules. We explore the last option by proposing UniCode in this work.
  • Figure 2: Illustration of multiple paradigms to obtain a unified codebook. Dotted line indicates the training loop: (a) frozen LLM codebook, which initiates the codebook with a pretrained LLM and freezes it during training; (b) dual alternative training, which jointly trains both visual tokenizer and LLM, by alternatively updating each one's codebook using the other's parameters. (c) language-driven iterative training, which smoothly updates the codebook of visual tokenizer with LLM's through a moving average manner.
  • Figure 3: Illustration of the procedure for the in-context image decompression task, which accepts the compressed quantized embeddings $\hat{Z}\in \mathbbm{R}^{\hat{h}\times \hat{w}}$ as inputs, and then proceeds to transform these embeddings into their flattened codes $\hat{M}\in \mathbbm{R}^{\hat{h}\times \hat{w}\times D}$ that are subsequently used for visual decoding.
  • Figure 4: Qualitative examples of text-conditioned image generation on CC3M.
  • Figure 5: Qualitative examples of image reconstruction generated by our proposed UniCode. Their raw images can be seen in the appendix.