Table of Contents
Fetching ...

Beyond Sight: Towards Cognitive Alignment in LVLM via Enriched Visual Knowledge

Yaqi Zhao, Yuanyang Yin, Lin Li, Mingan Lin, Victor Shea-Jay Huang, Siwei Chen, Weipeng Chen, Baoqun Yin, Zenan Zhou, Wentao Zhang

TL;DR

This work investigates how variations in VE representations influence LVLM comprehension, and proposes Entity-Enhanced Cognitive Alignment (EECA), a method that employs multi-granularity supervision to generate visually enriched, well-aligned tokens that not only integrate within the embedding space but also align with the LLM’s cognitive framework.

Abstract

Does seeing always mean knowing? Large Vision-Language Models (LVLMs) integrate separately pre-trained vision and language components, often using CLIP-ViT as vision backbone. However, these models frequently encounter a core issue of "cognitive misalignment" between the vision encoder (VE) and the large language model (LLM). Specifically, the VE's representation of visual information may not fully align with LLM's cognitive framework, leading to a mismatch where visual features exceed the language model's interpretive range. To address this, we investigate how variations in VE representations influence LVLM comprehension, especially when the LLM faces VE-Unknown data-images whose ambiguous visual representations challenge the VE's interpretive precision. Accordingly, we construct a multi-granularity landmark dataset and systematically examine the impact of VE-Known and VE-Unknown data on interpretive abilities. Our results show that VE-Unknown data limits LVLM's capacity for accurate understanding, while VE-Known data, rich in distinctive features, helps reduce cognitive misalignment. Building on these insights, we propose Entity-Enhanced Cognitive Alignment (EECA), a method that employs multi-granularity supervision to generate visually enriched, well-aligned tokens that not only integrate within the LLM's embedding space but also align with the LLM's cognitive framework. This alignment markedly enhances LVLM performance in landmark recognition. Our findings underscore the challenges posed by VE-Unknown data and highlight the essential role of cognitive alignment in advancing multimodal systems.

Beyond Sight: Towards Cognitive Alignment in LVLM via Enriched Visual Knowledge

TL;DR

This work investigates how variations in VE representations influence LVLM comprehension, and proposes Entity-Enhanced Cognitive Alignment (EECA), a method that employs multi-granularity supervision to generate visually enriched, well-aligned tokens that not only integrate within the embedding space but also align with the LLM’s cognitive framework.

Abstract

Does seeing always mean knowing? Large Vision-Language Models (LVLMs) integrate separately pre-trained vision and language components, often using CLIP-ViT as vision backbone. However, these models frequently encounter a core issue of "cognitive misalignment" between the vision encoder (VE) and the large language model (LLM). Specifically, the VE's representation of visual information may not fully align with LLM's cognitive framework, leading to a mismatch where visual features exceed the language model's interpretive range. To address this, we investigate how variations in VE representations influence LVLM comprehension, especially when the LLM faces VE-Unknown data-images whose ambiguous visual representations challenge the VE's interpretive precision. Accordingly, we construct a multi-granularity landmark dataset and systematically examine the impact of VE-Known and VE-Unknown data on interpretive abilities. Our results show that VE-Unknown data limits LVLM's capacity for accurate understanding, while VE-Known data, rich in distinctive features, helps reduce cognitive misalignment. Building on these insights, we propose Entity-Enhanced Cognitive Alignment (EECA), a method that employs multi-granularity supervision to generate visually enriched, well-aligned tokens that not only integrate within the LLM's embedding space but also align with the LLM's cognitive framework. This alignment markedly enhances LVLM performance in landmark recognition. Our findings underscore the challenges posed by VE-Unknown data and highlight the essential role of cognitive alignment in advancing multimodal systems.

Paper Structure

This paper contains 48 sections, 6 equations, 15 figures, 7 tables.

Figures (15)

  • Figure 1: Instances of Cognitive misalignment are systematically identified, even in advanced models like GPT-4o and Qwen2-VL. Although the image closely aligns with the description generated from the text-only prompt, both models fail to recognize the landmark when presented with the image. Text highlighted in green emphasizes details that are particularly relevant to the image.
  • Figure 2: Illustration of the dataset construction process for the Multi-granularity Landmark Dataset (MGLD), showing three stages: best image selection using CLIP similarity, data annotation with Q-A pairs, and multi-granularity data annotation.
  • Figure 3: t-SNE visualization of image features. Left: The HDS subset shows more dispersed representations for categories(e.g., "church"). Middle: The HSS subset shows distinct inter-class separations. Right: The LCS subset shows reduced intra-class variability and less distinct inter-class separations.
  • Figure 4: Category counts across subsets.
  • Figure 5: Comparative performance of HDS and BRS selection methods across different dataset sizes. Left: Accuracy (%) vs. Data Size for BRS and HDS with two point of EECA at 25k and 50k(best). Right: Percentage increase over baseline. VE-Known data outperforms Reference, demonstrating its effectiveness.
  • ...and 10 more figures