Table of Contents
Fetching ...

Finer: Investigating and Enhancing Fine-Grained Visual Concept Recognition in Large Vision Language Models

Jeonghwan Kim, Heng Ji

TL;DR

This work examines why instruction-tuned LVLMs struggle with fine-grained visual concept recognition, identifying a modality gap between textual and visual processing. It introduces Finer, an attribute-centric FGVC benchmark built from six datasets, and AttrSeek, a prompting scheme to elicit discriminative attributes from LVLMs, complemented by a Finer training mixture to improve zero-shot FGVC. Through probing and projection analyses, the paper reveals that concept-related knowledge is embedded in model parameters but not effectively leveraged by the image pathway, and that visual information is degraded when projected into textual space. The results show substantial FGVC gains for GPT-4V and LLaVA-1.5 when using AttrSeek and the Finer mixture, suggesting a practical path to ground FGVC in LVLMs and improve explainability, while highlighting remaining challenges such as intra-concept variance and baseline diversity.

Abstract

Recent advances in instruction-tuned Large Vision-Language Models (LVLMs) have imbued the models with the ability to generate high-level, image-grounded explanations with ease. While such capability is largely attributed to the rich world knowledge contained within the Large Language Models (LLMs), our work reveals their shortcomings in fine-grained visual categorization (FGVC) across six different benchmark settings. Most recent state-of-the-art LVLMs like LLaVa-1.5, InstructBLIP and GPT-4V not only severely deteriorate in terms of classification performance, e.g., average drop of 65.58 in EM for Stanford Dogs for LLaVA-1.5, but also struggle to generate an accurate explanation with detailed attributes based on the concept that appears within an input image despite their capability to generate holistic image-level descriptions. In-depth analyses show that instruction-tuned LVLMs exhibit modality gap, showing discrepancy when given textual and visual inputs that correspond to the same concept, preventing the image modality from leveraging the rich parametric knowledge within the LLMs. In an effort to further the community's endeavor in this direction, we propose a multiple granularity attribute-centric evaluation benchmark, Finer, which aims to establish a ground to evaluate LVLMs' fine-grained visual comprehension ability and provide significantly improved explainability.

Finer: Investigating and Enhancing Fine-Grained Visual Concept Recognition in Large Vision Language Models

TL;DR

This work examines why instruction-tuned LVLMs struggle with fine-grained visual concept recognition, identifying a modality gap between textual and visual processing. It introduces Finer, an attribute-centric FGVC benchmark built from six datasets, and AttrSeek, a prompting scheme to elicit discriminative attributes from LVLMs, complemented by a Finer training mixture to improve zero-shot FGVC. Through probing and projection analyses, the paper reveals that concept-related knowledge is embedded in model parameters but not effectively leveraged by the image pathway, and that visual information is degraded when projected into textual space. The results show substantial FGVC gains for GPT-4V and LLaVA-1.5 when using AttrSeek and the Finer mixture, suggesting a practical path to ground FGVC in LVLMs and improve explainability, while highlighting remaining challenges such as intra-concept variance and baseline diversity.

Abstract

Recent advances in instruction-tuned Large Vision-Language Models (LVLMs) have imbued the models with the ability to generate high-level, image-grounded explanations with ease. While such capability is largely attributed to the rich world knowledge contained within the Large Language Models (LLMs), our work reveals their shortcomings in fine-grained visual categorization (FGVC) across six different benchmark settings. Most recent state-of-the-art LVLMs like LLaVa-1.5, InstructBLIP and GPT-4V not only severely deteriorate in terms of classification performance, e.g., average drop of 65.58 in EM for Stanford Dogs for LLaVA-1.5, but also struggle to generate an accurate explanation with detailed attributes based on the concept that appears within an input image despite their capability to generate holistic image-level descriptions. In-depth analyses show that instruction-tuned LVLMs exhibit modality gap, showing discrepancy when given textual and visual inputs that correspond to the same concept, preventing the image modality from leveraging the rich parametric knowledge within the LLMs. In an effort to further the community's endeavor in this direction, we propose a multiple granularity attribute-centric evaluation benchmark, Finer, which aims to establish a ground to evaluate LVLMs' fine-grained visual comprehension ability and provide significantly improved explainability.
Paper Structure (41 sections, 6 figures, 14 tables)

This paper contains 41 sections, 6 figures, 14 tables.

Figures (6)

  • Figure 1: Current state-of-the-art LVLMs exhibit strong zero-shot downstream task solving abilities (e.g., image captioning, VQA, reasoning). However, when prompted to classify the fine-grained concepts, most of them fail to distinguish them into finer categories. Fine-grained classification prompt here is omitted for brevity.
  • Figure 2: State-of-the-art instruction-tuned LVLM zero-shot performance on fine-grained classification. All the models exhibit strong classification capabilities when prompted to classify superordinate-level (e.g., birds, cars) and coarse-grained categories(e.g., owls, SUVs), but exhibit significant deterioration in performance when prompted to categorize more fine-grained categories on the same images. The gold tags for coarse- and fine-grained classifications denote the use of gold labels from the parent category in the prompt.
  • Figure 3: Fine-grained classification pipeline. At each level, an output from LVLM is injected into the next level prompt. (1) Superordinate-level prompt is used to predict the highest-level category (e.g., bird). (2) Coarse-level prompt is subsequently fed with the predicted output and fed back to the LVLM to generate the next output (e.g., parrot), and (3a) and (4) follow the same steps. (3b) illustrates AttrSeek, a newly proposed prompting scheme in this work, wherein the model is prompted to generate the visual attributes.
  • Figure 4: Model Performance on Text-only vs. Image-only Inputs. LLaVA-1.5 (7B and 13B), when provided only the textual information (7B-, 13B-Text) related to the ground-truth concept, outperforms the image-only input (7B-, 13B-Image) counterpart.
  • Figure 5: Linear Probing on Projected Image Embeddings. Classification accuracy (%) for before and after image embedding projection to textual space.
  • ...and 1 more figures