Table of Contents
Fetching ...

Intriguing Properties of Large Language and Vision Models

Young-Jun Lee, Byungsoo Ko, Han-Gyu Kim, Yechan Hwang, Ho-Jin Choi

TL;DR

Evaluating the most common LLVM's families reveals several intriguing properties of current LLVMs, including permutation invariance, robustness, math reasoning, alignment preserving and importance, and suggests potential future directions for building better LLVMs and constructing more challenging evaluation benchmarks.

Abstract

Recently, large language and vision models (LLVMs) have received significant attention and development efforts due to their remarkable generalization performance across a wide range of tasks requiring perception and cognitive abilities. A key factor behind their success is their simple architecture, which consists of a vision encoder, a projector, and a large language model (LLM). Despite their achievements in advanced reasoning tasks, their performance on fundamental perception-related tasks (e.g., MMVP) remains surprisingly low. This discrepancy raises the question of how LLVMs truly perceive images and exploit the advantages of the vision encoder. To address this, we systematically investigate this question regarding several aspects: permutation invariance, robustness, math reasoning, alignment preserving and importance, by evaluating the most common LLVM's families (i.e., LLaVA) across 10 evaluation benchmarks. Our extensive experiments reveal several intriguing properties of current LLVMs: (1) they internally process the image in a global manner, even when the order of visual patch sequences is randomly permuted; (2) they are sometimes able to solve math problems without fully perceiving detailed numerical information; (3) the cross-modal alignment is overfitted to complex reasoning tasks, thereby, causing them to lose some of the original perceptual capabilities of their vision encoder; (4) the representation space in the lower layers (<25%) plays a crucial role in determining performance and enhancing visual understanding. Lastly, based on the above observations, we suggest potential future directions for building better LLVMs and constructing more challenging evaluation benchmarks.

Intriguing Properties of Large Language and Vision Models

TL;DR

Evaluating the most common LLVM's families reveals several intriguing properties of current LLVMs, including permutation invariance, robustness, math reasoning, alignment preserving and importance, and suggests potential future directions for building better LLVMs and constructing more challenging evaluation benchmarks.

Abstract

Recently, large language and vision models (LLVMs) have received significant attention and development efforts due to their remarkable generalization performance across a wide range of tasks requiring perception and cognitive abilities. A key factor behind their success is their simple architecture, which consists of a vision encoder, a projector, and a large language model (LLM). Despite their achievements in advanced reasoning tasks, their performance on fundamental perception-related tasks (e.g., MMVP) remains surprisingly low. This discrepancy raises the question of how LLVMs truly perceive images and exploit the advantages of the vision encoder. To address this, we systematically investigate this question regarding several aspects: permutation invariance, robustness, math reasoning, alignment preserving and importance, by evaluating the most common LLVM's families (i.e., LLaVA) across 10 evaluation benchmarks. Our extensive experiments reveal several intriguing properties of current LLVMs: (1) they internally process the image in a global manner, even when the order of visual patch sequences is randomly permuted; (2) they are sometimes able to solve math problems without fully perceiving detailed numerical information; (3) the cross-modal alignment is overfitted to complex reasoning tasks, thereby, causing them to lose some of the original perceptual capabilities of their vision encoder; (4) the representation space in the lower layers (<25%) plays a crucial role in determining performance and enhancing visual understanding. Lastly, based on the above observations, we suggest potential future directions for building better LLVMs and constructing more challenging evaluation benchmarks.
Paper Structure (33 sections, 3 equations, 9 figures, 3 tables)

This paper contains 33 sections, 3 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: We demonstrate the extent to which group-wise visual tokens capture region-specific information (PIL) for LLaVA-1.5-7B on the MMStar chen2024we and MME fu2023mme. Darker regions indicate areas where the model retains more localized information for those specific groups.
  • Figure 2: We present the performance across different grid sizes (2, 4, 8, 14) on the MMVP, MM-Vet, MathVista, and AI2D datasets, using three models: LLaVA-1.5, LLaVA-NeXT, and LLaVA-OneVision.
  • Figure 3: We present examples of shuffled images with different grid sizes (2, 4, 8, 14) derived from a MathVista dataset image. As the grid size increases, the chart image becomes more artistically styled.
  • Figure 4: We present performance on the GSM8K dataset using 8-shot Chain-of-Thought prompting. Additionally, we demonstrate that scaling up the instruction-tuning dataset enables LLVMs to solve text-only math reasoning problems more effectively.
  • Figure 5: We present examples of images (left) synthesized by SDXL-Lightning and (right) occluded using three methods: Random, Salient, and Non-Salient. The original images are from the MathVista and MME datasets. Occluded areas are marked in black to indicate zero pixel values.
  • ...and 4 more figures

Theorems & Definitions (1)

  • Definition 3.1: Importance Score