Table of Contents
Fetching ...

AlignGPT: Multi-modal Large Language Models with Adaptive Alignment Capability

Fei Zhao, Taotian Pang, Chunhui Li, Zhen Wu, Junjie Guo, Shangyu Xing, Xinyu Dai

TL;DR

A new multimodal large language model AlignGPT is proposed, which divides image-text pairs into different groups according to the degrees of alignment, and adaptively combines representations of alignment levels to meet the dynamic alignment needs of different tasks.

Abstract

Multimodal Large Language Models (MLLMs) are widely regarded as crucial in the exploration of Artificial General Intelligence (AGI). The core of MLLMs lies in their capability to achieve cross-modal alignment. To attain this goal, current MLLMs typically follow a two-phase training paradigm: the pre-training phase and the instruction-tuning phase. Despite their success, there are shortcomings in the modeling of alignment capabilities within these models. Firstly, during the pre-training phase, the model usually assumes that all image-text pairs are uniformly aligned, but in fact the degree of alignment between different image-text pairs is inconsistent. Secondly, the instructions currently used for finetuning incorporate a variety of tasks and different tasks usually require different levels of alignment capabilities, but previous MLLMs overlook these differentiated alignment needs. To tackle these issues, we propose a new multimodal large language model AlignGPT. In the pre-training stage, instead of treating all image-text pairs equally, we divide them into different groups according to the degrees of alignment of them. Then, the model is trained to learn the representations of different alignment levels. In the instruction-tuning phase, we adaptively combine these representations of alignment levels to meet the dynamic alignment needs of different tasks. Extensive experimental results show that our model achieves competitive performance on 12 benchmarks.

AlignGPT: Multi-modal Large Language Models with Adaptive Alignment Capability

TL;DR

A new multimodal large language model AlignGPT is proposed, which divides image-text pairs into different groups according to the degrees of alignment, and adaptively combines representations of alignment levels to meet the dynamic alignment needs of different tasks.

Abstract

Multimodal Large Language Models (MLLMs) are widely regarded as crucial in the exploration of Artificial General Intelligence (AGI). The core of MLLMs lies in their capability to achieve cross-modal alignment. To attain this goal, current MLLMs typically follow a two-phase training paradigm: the pre-training phase and the instruction-tuning phase. Despite their success, there are shortcomings in the modeling of alignment capabilities within these models. Firstly, during the pre-training phase, the model usually assumes that all image-text pairs are uniformly aligned, but in fact the degree of alignment between different image-text pairs is inconsistent. Secondly, the instructions currently used for finetuning incorporate a variety of tasks and different tasks usually require different levels of alignment capabilities, but previous MLLMs overlook these differentiated alignment needs. To tackle these issues, we propose a new multimodal large language model AlignGPT. In the pre-training stage, instead of treating all image-text pairs equally, we divide them into different groups according to the degrees of alignment of them. Then, the model is trained to learn the representations of different alignment levels. In the instruction-tuning phase, we adaptively combine these representations of alignment levels to meet the dynamic alignment needs of different tasks. Extensive experimental results show that our model achieves competitive performance on 12 benchmarks.
Paper Structure (31 sections, 3 equations, 7 figures, 6 tables)

This paper contains 31 sections, 3 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: AlignGPT achieves competitive performances on a broad range of vision-language tasks compared with other generalist models. To facilitate observation, we only show the performance of MiniGPT-v2 and AlignGPT.
  • Figure 2: Examples of image-text pairs in the pre-training dataset, where the numbers in each image represent the CLIP similarity.
  • Figure 3: The distribution of CLIP similarity scores between images and texts in the pre-trained dataset.
  • Figure 4: The architecture of AlignGPT.
  • Figure 5: The performance comparison of AlignGPT (random) and AlignGPT in downstream datasets.
  • ...and 2 more figures