Exploring the Distinctiveness and Fidelity of the Descriptions Generated by Large Vision-Language Models

Yuhang Huang; Zihan Wu; Chongyang Gao; Jiawei Peng; Xu Yang

Exploring the Distinctiveness and Fidelity of the Descriptions Generated by Large Vision-Language Models

Yuhang Huang, Zihan Wu, Chongyang Gao, Jiawei Peng, Xu Yang

TL;DR

This research provides valuable insights into the generation quality of LVLMs, enhancing the understanding of multimodal language models and proposed the Textual Retrieval-Augmented Classification (TRAC) framework, which allows us to delve deeper into analyzing fine-grained visual description generation.

Abstract

Large Vision-Language Models (LVLMs) are gaining traction for their remarkable ability to process and integrate visual and textual data. Despite their popularity, the capacity of LVLMs to generate precise, fine-grained textual descriptions has not been fully explored. This study addresses this gap by focusing on \textit{distinctiveness} and \textit{fidelity}, assessing how models like Open-Flamingo, IDEFICS, and MiniGPT-4 can distinguish between similar objects and accurately describe visual features. We proposed the Textual Retrieval-Augmented Classification (TRAC) framework, which, by leveraging its generative capabilities, allows us to delve deeper into analyzing fine-grained visual description generation. This research provides valuable insights into the generation quality of LVLMs, enhancing the understanding of multimodal language models. Notably, MiniGPT-4 stands out for its better ability to generate fine-grained descriptions, outperforming the other two models in this aspect. The code is provided at \url{https://anonymous.4open.science/r/Explore_FGVDs-E277}.

Exploring the Distinctiveness and Fidelity of the Descriptions Generated by Large Vision-Language Models

TL;DR

Abstract

Paper Structure (21 sections, 3 equations, 9 figures, 6 tables)

This paper contains 21 sections, 3 equations, 9 figures, 6 tables.

Introduction
Related Works
Generating Visual Descriptions
Vision Language Models (VLMs)
Evaluation of Vision-Language Models (VLMs)
Method
Fine-Grained Visual Description Generation
Dual-Evaluation
Distinctiveness
Fidelity
Experiments
Datasets and Implementation Details
Distinctiveness Evaluation
Fidelity Evaluation
Qualitative Results
...and 6 more sections

Figures (9)

Figure 1: The caption produced by a smaller Vision Language Model (VLM) offers a broad overview of the image. In contrast, the fine-grained visual description (FGVD), generated by the Large Vision Language Model (LVLM) conditioned on both visual and linguistic cues, encompasses more nuanced details.
Figure 2: An overview of our framework for evaluating the quality of fine-grained visual descriptions (FGVDs) generated by Large Vision-Language Models (LVLMs). In the FGVD Generation phase (a), FGVDs are produced by conditioning on both visual and linguistic cues. Subsequently, we evaluate the quality of generated content in terms of its distinctiveness (b) and fidelity (c).
Figure 3: Results of LVLMs under different distinctness methods on five datasets.
Figure 4: The distinctiveness results at different $k$-values
Figure 5: Human evaluation results assessing the fidelity of descriptions generated by LVLMs. Top: Average scores across various models. Bottom: Score distribution from 1 to 5 for each model.
...and 4 more figures

Exploring the Distinctiveness and Fidelity of the Descriptions Generated by Large Vision-Language Models

TL;DR

Abstract

Exploring the Distinctiveness and Fidelity of the Descriptions Generated by Large Vision-Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (9)