Seeing Clearly, Reasoning Confidently: Plug-and-Play Remedies for Vision Language Model Blindness

Xin Hu; Haomiao Ni; Yunbei Zhang; Jihun Hamm; Zechen Li; Zhengming Ding

Seeing Clearly, Reasoning Confidently: Plug-and-Play Remedies for Vision Language Model Blindness

Xin Hu, Haomiao Ni, Yunbei Zhang, Jihun Hamm, Zechen Li, Zhengming Ding

TL;DR

This paper proposes to learn multi-modal class embeddings for rare objects by leveraging prior knowledge from vision foundation models and synonym-augmented text descriptions, compensating for limited training examples, and proposes a lightweight attention-based enhancement module that improves fine-grained object details.

Abstract

Vision language models (VLMs) have achieved remarkable success in broad visual understanding, yet they remain challenged by object-centric reasoning on rare objects due to the scarcity of such instances in pretraining data. While prior efforts alleviate this issue by retrieving additional data or introducing stronger vision encoders, these methods are still computationally intensive during finetuning VLMs and don't fully exploit the original training data. In this paper, we introduce an efficient plug-and-play module that substantially improves VLMs' reasoning over rare objects by refining visual tokens and enriching input text prompts, without VLMs finetuning. Specifically, we propose to learn multi-modal class embeddings for rare objects by leveraging prior knowledge from vision foundation models and synonym-augmented text descriptions, compensating for limited training examples. These embeddings refine the visual tokens in VLMs through a lightweight attention-based enhancement module that improves fine-grained object details. In addition, we use the learned embeddings as object-aware detectors to generate informative hints, which are injected into the text prompts to help guide the VLM's attention toward relevant image regions. Experiments on two benchmarks show consistent and substantial gains for pretrained VLMs in rare object recognition and reasoning. Further analysis reveals how our method strengthens the VLM's ability to focus on and reason about rare objects.

Seeing Clearly, Reasoning Confidently: Plug-and-Play Remedies for Vision Language Model Blindness

TL;DR

Abstract

Paper Structure (17 sections, 8 equations, 7 figures, 2 tables)

This paper contains 17 sections, 8 equations, 7 figures, 2 tables.

Introduction
Related Work
Proposed Method
Preliminaries
Motivation
Learning Multi-modal Class Embedding
Adaptive Semantic Augmentation
Visual-Language Alignment
Visual Token Refined Perception
Text Hints Injected Reasoning
Experiments
Experimental Setting
Comparison Results
Ablation Study
Interpretable Analysis
...and 2 more sections

Figures (7)

Figure 1: Comparison on rare object recognition: (a) shows that LLaVA tends to predict the "bollard" as a common object "traffic light", while (b) demonstrates that our method corrects LLaVA by predicting "bollard" and providing reasoning through visual enhancement and text prompt enrichment with object hints, both based on the learned multi-modal class embeddings.
Figure 2: Visual attention on the object "bollard" from the CODA-LM dataset. The attention weights across layers show that LLaVA-1.5-7B allocates less attention to the target object region. Brighter colors indicate higher attention weights.
Figure 3: Overview of the model framework, which consists of three main components: (a) a multi-modal class embedding learning module, which fuses object visual features with synonym-augmented text features; (b) a visual token enhancement module, which applies a cross-attention mechanism between class embeddings and image visual tokens in VLMs; and (c) a text hints injection module, which leverages the learned multi-modal class embeddings for object identification and enriches the text prompt with object hints.
Figure 4: Ablation study of visual refinement and text hints for LLaVA-1.5-7B on the CODA-LM dataset.
Figure 5: Comparison of different $k$ for LLaVA-7B on CODA-LM. "Detection Accuracy" is the top-$k$ detection accuracy of multi-modal class embeddings for objects. "VLM Accuracy" measures how VLMs recognize objects with/without our hints. "Trust Rate" is the ratio of VLMs' output that aligns with our hints.
...and 2 more figures

Seeing Clearly, Reasoning Confidently: Plug-and-Play Remedies for Vision Language Model Blindness

TL;DR

Abstract

Seeing Clearly, Reasoning Confidently: Plug-and-Play Remedies for Vision Language Model Blindness

Authors

TL;DR

Abstract

Table of Contents

Figures (7)