Table of Contents
Fetching ...

Effectiveness Assessment of Recent Large Vision-Language Models

Yao Jiang, Xinyu Yan, Ge-Peng Ji, Keren Fu, Meijun Sun, Huan Xiong, Deng-Ping Fan, Fahad Shahbaz Khan

TL;DR

Evaluated LVLMs demonstrate limited proficiency not only in specialized tasks but also in general tasks, including object counting, absurd question answering, affordance reasoning, attribute recognition, and spatial relation reasoning.

Abstract

The advent of large vision-language models (LVLMs) represents a remarkable advance in the quest for artificial general intelligence. However, the model's effectiveness in both specialized and general tasks warrants further investigation. This paper endeavors to evaluate the competency of popular LVLMs in specialized and general tasks, respectively, aiming to offer a comprehensive understanding of these novel models. To gauge their effectiveness in specialized tasks, we employ six challenging tasks in three different application scenarios: natural, healthcare, and industrial. These six tasks include salient/camouflaged/transparent object detection, as well as polyp detection, skin lesion detection, and industrial anomaly detection. We examine the performance of three recent open-source LVLMs, including MiniGPT-v2, LLaVA-1.5, and Shikra, on both visual recognition and localization in these tasks. Moreover, we conduct empirical investigations utilizing the aforementioned LVLMs together with GPT-4V, assessing their multi-modal understanding capabilities in general tasks including object counting, absurd question answering, affordance reasoning, attribute recognition, and spatial relation reasoning. Our investigations reveal that these LVLMs demonstrate limited proficiency not only in specialized tasks but also in general tasks. We delve deep into this inadequacy and uncover several potential factors, including limited cognition in specialized tasks, object hallucination, text-to-image interference, and decreased robustness in complex problems. We hope that this study can provide useful insights for the future development of LVLMs, helping researchers improve LVLMs for both general and specialized applications.

Effectiveness Assessment of Recent Large Vision-Language Models

TL;DR

Evaluated LVLMs demonstrate limited proficiency not only in specialized tasks but also in general tasks, including object counting, absurd question answering, affordance reasoning, attribute recognition, and spatial relation reasoning.

Abstract

The advent of large vision-language models (LVLMs) represents a remarkable advance in the quest for artificial general intelligence. However, the model's effectiveness in both specialized and general tasks warrants further investigation. This paper endeavors to evaluate the competency of popular LVLMs in specialized and general tasks, respectively, aiming to offer a comprehensive understanding of these novel models. To gauge their effectiveness in specialized tasks, we employ six challenging tasks in three different application scenarios: natural, healthcare, and industrial. These six tasks include salient/camouflaged/transparent object detection, as well as polyp detection, skin lesion detection, and industrial anomaly detection. We examine the performance of three recent open-source LVLMs, including MiniGPT-v2, LLaVA-1.5, and Shikra, on both visual recognition and localization in these tasks. Moreover, we conduct empirical investigations utilizing the aforementioned LVLMs together with GPT-4V, assessing their multi-modal understanding capabilities in general tasks including object counting, absurd question answering, affordance reasoning, attribute recognition, and spatial relation reasoning. Our investigations reveal that these LVLMs demonstrate limited proficiency not only in specialized tasks but also in general tasks. We delve deep into this inadequacy and uncover several potential factors, including limited cognition in specialized tasks, object hallucination, text-to-image interference, and decreased robustness in complex problems. We hope that this study can provide useful insights for the future development of LVLMs, helping researchers improve LVLMs for both general and specialized applications.
Paper Structure (26 sections, 1 equation, 9 figures, 7 tables)

This paper contains 26 sections, 1 equation, 9 figures, 7 tables.

Figures (9)

  • Figure 1: Overall diagram of our evaluation platform. We evaluate the recent LVLMs in both specialized and general tasks using tailored prompts, with and without specifying object types. The specialized tasks include salient object detection (SOD), transparent object detection (TOD), camouflaged object detection (COD), polyp detection, skin lesion detection, as well as industrial anomaly detection (AD). The evaluation is realized by conducting recognition (§ \ref{['sec:perception']}) and localization (§ \ref{['sec:localization']}) under these tasks, and three recent open-source LVLMs (MiniGPT-v2 chen2023minigptv2, LLaVA-1.5 liu2023improvedllava1point5, and Shikra chen2023shikra) are tested. Besides, empirical investigations are conducted on the COCO lin2014microsoftCOCO dataset to reflect the capabilities of LVLMs in general tasks (§ \ref{['sec:quali_other']}), including object counting, absurd question answering, affordance reasoning, attribute recognition, and spatial relation reasoning. Examples are presented in each figure group, where "$<$...$>$" indicates a placeholder that can be replaced with other words/phrases in different tasks.
  • Figure 2: Responses of three LVLMs regarding the perception of camouflaged objects on negative samples. Incorrect responses are underlined in red and marked with crosses.
  • Figure 3: Detection and segmentation results of three LVLMs in six specialized tasks. The predicted bounding boxes and ground truth are marked with blue and green. From left to right in each scenario: detection (top) and segmentation (bottom) results of MiniGPT-v2 chen2023minigptv2, LLaVA-1.5 liu2023improvedllava1point5, and Shikra chen2023shikra, as well as segmentation results of upper bound (top) and the ground truth masks (bottom).
  • Figure 4: Responses of three LVLMs regarding locating given objects and recognizing objects of specific types. Predicted bounding boxes and ground truth are marked in blue and green. From top to bottom: examples of salient object detection, transparent object detection, and camouflaged object detection. Incorrect responses are marked with red underlines and crosses.
  • Figure 5: Responses of three LVLMs regarding recognizing and locating the anomaly. Predicted bounding boxes and ground truth are marked in blue and green, respectively. The incorrect responses are marked with red underlines and crosses.
  • ...and 4 more figures