Table of Contents
Fetching ...

The LLM Bottleneck: Why Open-Source Vision LLMs Struggle with Hierarchical Visual Recognition

Yuwen Tan, Yuan Qing, Boqing Gong

Abstract

This paper reveals that many open-source large language models (LLMs) lack hierarchical knowledge about our visual world, unaware of even well-established biology taxonomies. This shortcoming makes LLMs a bottleneck for vision LLMs' hierarchical visual recognition (e.g., recognizing Anemone Fish but not Vertebrate). We arrive at these findings using about one million four-choice visual question answering (VQA) tasks constructed from six taxonomies and four image datasets. Interestingly, finetuning a vision LLM using our VQA tasks reaffirms LLMs' bottleneck effect because the VQA tasks improve the LLMs' hierarchical consistency more than the vision LLMs'. We conjecture that one cannot make open-source vision LLMs understand visual concepts hierarchically until LLMs possess corresponding taxonomy knowledge.

The LLM Bottleneck: Why Open-Source Vision LLMs Struggle with Hierarchical Visual Recognition

Abstract

This paper reveals that many open-source large language models (LLMs) lack hierarchical knowledge about our visual world, unaware of even well-established biology taxonomies. This shortcoming makes LLMs a bottleneck for vision LLMs' hierarchical visual recognition (e.g., recognizing Anemone Fish but not Vertebrate). We arrive at these findings using about one million four-choice visual question answering (VQA) tasks constructed from six taxonomies and four image datasets. Interestingly, finetuning a vision LLM using our VQA tasks reaffirms LLMs' bottleneck effect because the VQA tasks improve the LLMs' hierarchical consistency more than the vision LLMs'. We conjecture that one cannot make open-source vision LLMs understand visual concepts hierarchically until LLMs possess corresponding taxonomy knowledge.

Paper Structure

This paper contains 51 sections, 9 equations, 13 figures, 26 tables.

Figures (13)

  • Figure 1: Left: Four-choice VQA tasks for evaluating VLLMs' hierarchical visual recognition. Right: A VLLM's answers (in red boxes) deviate from the ground truth path ( green arrows), illustrating its lack of hierarchical consistency.
  • Figure 2: Prompt variants and their effects on VLLMs' hierarchical consistency (HCA) and fine-grained recognition $\mathrm{Acc}_\mathrm{leaf}$ (Gen: general prompts, Hier: hierarchical prompts, +CoT: prompts with Chain-of-Thought reasoning, +Taxonomy: prompts that include an explicit taxonomy in the JSON format. Please see Appendix \ref{['sec:app:analysis']} for details and examples.).
  • Figure 3: Qwen2.5-VL-7B vs. linearly probing the visual tokens at various stages of Qwen2.5-VL-7B on CUB-200 and iNat21-Plant.
  • Figure 4: Text HCA of different VLLMs' LLMs over the iNat21-Plant taxonomies of various depths.
  • Figure 5: Left: (Text) HCA difference between vision-language-tuned LLMs and original ones. Right: (Text) HCA of linearly probing different layers of Qwen-2.5-VL-7B's LLM on iNat21-Plant.
  • ...and 8 more figures