Evaluating Attribute Comprehension in Large Vision-Language Models

Haiwen Zhang; Zixi Yang; Yuanzhi Liu; Xinran Wang; Zheqi He; Kongming Liang; Zhanyu Ma

Evaluating Attribute Comprehension in Large Vision-Language Models

Haiwen Zhang, Zixi Yang, Yuanzhi Liu, Xinran Wang, Zheqi He, Kongming Liang, Zhanyu Ma

TL;DR

This work addresses the challenge of fine-grained attribute comprehension in large vision-language models by introducing the Attribute Understanding Benchmark on the VAW dataset, evaluating attribute recognition and hierarchical understanding via ITC, ITM, and VQA. It formalizes attribute representations with a labeled vector $y \in \{-1,0,1\}^A$ and a hierarchical attribute DAG $\mathcal{T}$, and proposes post-hoc metrics (CmAP, CV) after hierarchy complementation. The results show that ITM captures finer attribute details than ITC and that attribute information embedded in training captions significantly improves attribute understanding, while image resolution plays a lesser role. Collectively, the findings offer design insights for future data collection and fine-tuning strategies to enhance compositional and hierarchical visual reasoning in vision-language systems.

Abstract

Currently, large vision-language models have gained promising progress on many downstream tasks. However, they still suffer many challenges in fine-grained visual understanding tasks, such as object attribute comprehension. Besides, there have been growing efforts on the evaluations of large vision-language models, but lack of in-depth study of attribute comprehension and the visual language fine-tuning process. In this paper, we propose to evaluate the attribute comprehension ability of large vision-language models from two perspectives: attribute recognition and attribute hierarchy understanding. We evaluate three vision-language interactions, including visual question answering, image-text matching, and image-text cosine similarity. Furthermore, we explore the factors affecting attribute comprehension during fine-tuning. Through a series of quantitative and qualitative experiments, we introduce three main findings: (1) Large vision-language models possess good attribute recognition ability, but their hierarchical understanding ability is relatively limited. (2) Compared to ITC, ITM exhibits superior capability in capturing finer details, making it more suitable for attribute understanding tasks. (3) The attribute information in the captions used for fine-tuning plays a crucial role in attribute understanding. We hope this work can help guide future progress in fine-grained visual understanding of large vision-language models.

Evaluating Attribute Comprehension in Large Vision-Language Models

TL;DR

and a hierarchical attribute DAG

, and proposes post-hoc metrics (CmAP, CV) after hierarchy complementation. The results show that ITM captures finer attribute details than ITC and that attribute information embedded in training captions significantly improves attribute understanding, while image resolution plays a lesser role. Collectively, the findings offer design insights for future data collection and fine-tuning strategies to enhance compositional and hierarchical visual reasoning in vision-language systems.

Abstract

Paper Structure (23 sections, 5 equations, 4 figures, 6 tables)

This paper contains 23 sections, 5 equations, 4 figures, 6 tables.

Introduction
Attribute Understanding Benchmark
Formulation
Evaluation Aspects
Attribute Recognition
Hierarchical Relationship Understanding
Evaluation Methodologies
Image-text Cosine Similarity (ITC)
Image-text Matching (ITM)
Visual Question Answering (VQA)
Evaluation Metrics
Comparison Methods
Experimental Results
Comparison with Close-Set Models
Analysis
...and 8 more sections

Figures (4)

Figure 1: Overview of the evaluation process. Original annotations of VAW are used for attribute recognition, while complementary annotations are used to evaluate hierarchical relationship understanding. We utilize the attribute tree liang2023hierarchical to perform complementation, and the inference results of ITM and ITC are presented as scores. For VQA, the results are presented as either 'Yes' or 'No'.
Figure 2: Attribute recognition accuracy of different models. The performance is assessed across eight attribute types within the VAW dataset.
Figure 3: Hierarchical relationship understanding comparison between mPLUG and MiniGPT-4, where the parent attributes are highlighted in red and the children attributes are highlighted in green.
Figure 4: Left: Comparison between Grad-CAMselvaraju2017grad visualizations for BLIP ITM and ITC, corresponding to the prompt. Scores are displayed below the picture. Prompts: (a). Negative: The material of the tray is paper. Positive: The material of the tray is metal. (b). Negative: The color of the beak is yellow. Positive: The shape of the beak is pointy. (c). Negative: The sport activity of the boy is playing football. Positive: The sport activity of the boy is skateboarding. Scores are present below the picture. Right: The distribution of positive and negative attribute prediction scores for BLIP li2022blip.

Evaluating Attribute Comprehension in Large Vision-Language Models

TL;DR

Abstract

Evaluating Attribute Comprehension in Large Vision-Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (4)