Table of Contents
Fetching ...

Multi-modal Auto-regressive Modeling via Visual Words

Tianshuo Peng, Zuchao Li, Lefei Zhang, Hai Zhao, Ping Wang, Bo Du

TL;DR

This paper proposes the concept of visual tokens, which maps the visual features to probability distributions over LLM's vocabulary, providing supervision information for visual modelling, and explores the distribution of visual features in the semantic space within LMM and the possibility of using text embeddings to represent visual information.

Abstract

Large Language Models (LLMs), benefiting from the auto-regressive modelling approach performed on massive unannotated texts corpora, demonstrates powerful perceptual and reasoning capabilities. However, as for extending auto-regressive modelling to multi-modal scenarios to build Large Multi-modal Models (LMMs), there lies a great difficulty that the image information is processed in the LMM as continuous visual embeddings, which cannot obtain discrete supervised labels for classification.In this paper, we successfully perform multi-modal auto-regressive modeling with a unified objective for the first time.Specifically, we propose the concept of visual tokens, which maps the visual features to probability distributions over LLM's vocabulary, providing supervision information for visual modelling.We further explore the distribution of visual features in the semantic space within LMM and the possibility of using text embeddings to represent visual information.Experimental results and ablation studies on 5 VQA tasks and 4 benchmark toolkits validate the powerful performance of our proposed approach.

Multi-modal Auto-regressive Modeling via Visual Words

TL;DR

This paper proposes the concept of visual tokens, which maps the visual features to probability distributions over LLM's vocabulary, providing supervision information for visual modelling, and explores the distribution of visual features in the semantic space within LMM and the possibility of using text embeddings to represent visual information.

Abstract

Large Language Models (LLMs), benefiting from the auto-regressive modelling approach performed on massive unannotated texts corpora, demonstrates powerful perceptual and reasoning capabilities. However, as for extending auto-regressive modelling to multi-modal scenarios to build Large Multi-modal Models (LMMs), there lies a great difficulty that the image information is processed in the LMM as continuous visual embeddings, which cannot obtain discrete supervised labels for classification.In this paper, we successfully perform multi-modal auto-regressive modeling with a unified objective for the first time.Specifically, we propose the concept of visual tokens, which maps the visual features to probability distributions over LLM's vocabulary, providing supervision information for visual modelling.We further explore the distribution of visual features in the semantic space within LMM and the possibility of using text embeddings to represent visual information.Experimental results and ablation studies on 5 VQA tasks and 4 benchmark toolkits validate the powerful performance of our proposed approach.
Paper Structure (23 sections, 9 equations, 4 figures, 5 tables)

This paper contains 23 sections, 9 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Performance of our proposed VT-LMM. We set the unreported results in the original paper to 0 to avoid confusion.
  • Figure 2: The comparisons between different LMMs. (a) Visual Context Method focuses on predicting language responses in multi-modal contextual sequences that depend on visual information, where the visual information merely acts as contextual cues and does not serve as supervision. (b) Visual regression method takes regression task to predict the value of next visual feature and performs joint training with text. (c) Our multi-modal auto-regressive method use visual tokens to construct visual supervision labels and enabling multi-modal auto-regressive modelling with unified classification objective.
  • Figure 3: The overview of our method. (a) The overall framework of the model. VT-LMM uses the VM head to transform visual features in multi-modal input sequences to probability distributions over LLM's vocabulary (so-called visual tokens) to participate in visual modelling (b) Constructing pseudo image features with pre-trained embedding of LLM and visual tokens. (c) Demonstration of semantically closest tokens of each image patch in LMM.
  • Figure 4: Image regions with the highest probability tokens in the visual tokens. Best viewed zoomed-in.