Table of Contents
Fetching ...

REVAL: A Comprehension Evaluation on Reliability and Values of Large Vision-Language Models

Jie Zhang, Zheng Yuan, Zhongqi Wang, Bei Yan, Sibo Wang, Xiangkui Cao, Zonghui Guo, Shiguang Shan, Xilin Chen

TL;DR

REVAL introduces a comprehensive, multi-dimension benchmark for evaluating large vision-language models across reliability and values. By aggregating 144K VQA samples and assessing 26 models, it reveals that LVLMs excel at perceptual tasks and toxicity avoidance but face vulnerabilities in adversarial robustness, privacy protection, and ethical reasoning. The framework combines Dysca-based perception and hallucination analyses with diverse attack types and privacy/safety/morality assessments, generating fine-grained, controllable insights. These findings highlight critical gaps and provide a scalable platform to steer the development of more secure, reliable, and ethically aligned LVLMs for real-world deployment.

Abstract

The rapid evolution of Large Vision-Language Models (LVLMs) has highlighted the necessity for comprehensive evaluation frameworks that assess these models across diverse dimensions. While existing benchmarks focus on specific aspects such as perceptual abilities, cognitive capabilities, and safety against adversarial attacks, they often lack the breadth and depth required to provide a holistic understanding of LVLMs' strengths and limitations. To address this gap, we introduce REVAL, a comprehensive benchmark designed to evaluate the \textbf{RE}liability and \textbf{VAL}ue of LVLMs. REVAL encompasses over 144K image-text Visual Question Answering (VQA) samples, structured into two primary sections: Reliability, which assesses truthfulness (\eg, perceptual accuracy and hallucination tendencies) and robustness (\eg, resilience to adversarial attacks, typographic attacks, and image corruption), and Values, which evaluates ethical concerns (\eg, bias and moral understanding), safety issues (\eg, toxicity and jailbreak vulnerabilities), and privacy problems (\eg, privacy awareness and privacy leakage). We evaluate 26 models, including mainstream open-source LVLMs and prominent closed-source models like GPT-4o and Gemini-1.5-Pro. Our findings reveal that while current LVLMs excel in perceptual tasks and toxicity avoidance, they exhibit significant vulnerabilities in adversarial scenarios, privacy preservation, and ethical reasoning. These insights underscore critical areas for future improvements, guiding the development of more secure, reliable, and ethically aligned LVLMs. REVAL provides a robust framework for researchers to systematically assess and compare LVLMs, fostering advancements in the field.

REVAL: A Comprehension Evaluation on Reliability and Values of Large Vision-Language Models

TL;DR

REVAL introduces a comprehensive, multi-dimension benchmark for evaluating large vision-language models across reliability and values. By aggregating 144K VQA samples and assessing 26 models, it reveals that LVLMs excel at perceptual tasks and toxicity avoidance but face vulnerabilities in adversarial robustness, privacy protection, and ethical reasoning. The framework combines Dysca-based perception and hallucination analyses with diverse attack types and privacy/safety/morality assessments, generating fine-grained, controllable insights. These findings highlight critical gaps and provide a scalable platform to steer the development of more secure, reliable, and ethically aligned LVLMs for real-world deployment.

Abstract

The rapid evolution of Large Vision-Language Models (LVLMs) has highlighted the necessity for comprehensive evaluation frameworks that assess these models across diverse dimensions. While existing benchmarks focus on specific aspects such as perceptual abilities, cognitive capabilities, and safety against adversarial attacks, they often lack the breadth and depth required to provide a holistic understanding of LVLMs' strengths and limitations. To address this gap, we introduce REVAL, a comprehensive benchmark designed to evaluate the \textbf{RE}liability and \textbf{VAL}ue of LVLMs. REVAL encompasses over 144K image-text Visual Question Answering (VQA) samples, structured into two primary sections: Reliability, which assesses truthfulness (\eg, perceptual accuracy and hallucination tendencies) and robustness (\eg, resilience to adversarial attacks, typographic attacks, and image corruption), and Values, which evaluates ethical concerns (\eg, bias and moral understanding), safety issues (\eg, toxicity and jailbreak vulnerabilities), and privacy problems (\eg, privacy awareness and privacy leakage). We evaluate 26 models, including mainstream open-source LVLMs and prominent closed-source models like GPT-4o and Gemini-1.5-Pro. Our findings reveal that while current LVLMs excel in perceptual tasks and toxicity avoidance, they exhibit significant vulnerabilities in adversarial scenarios, privacy preservation, and ethical reasoning. These insights underscore critical areas for future improvements, guiding the development of more secure, reliable, and ethically aligned LVLMs. REVAL provides a robust framework for researchers to systematically assess and compare LVLMs, fostering advancements in the field.

Paper Structure

This paper contains 24 sections, 2 equations, 5 figures, 15 tables.

Figures (5)

  • Figure 1: The framework of REVAL benchmark. Different colors represent distinct evaluation perspectives. For each topic, we provide an evaluation example along with the corresponding responses from several models. Each answer is preceded by a check mark or cross to indicate whether the answer matches the correct answer, and key clues that inform the judgment of correctness are highlighted in italics.
  • Figure 2: Radar charts for each topic evaluated in the reliability section. Each radar chart shows the results of the six models with the best overall performance in \ref{['tab:result']}. Different axes in the radar chart represent the various dimensions assessed under that topic. All results are normalized to 0 to 100, and higher values indicate better performance of the model.
  • Figure 3: The radar charts for each topic evaluated in the values section. Each radar chart shows the results of the six models with the best overall performance in \ref{['tab:result']}. Different axes in the radar chart represent the various dimensions assessed under that topic. All results are normalized to 0 to 100, and higher values indicate better performance of the model.
  • Figure 4: The Visualization of hierarchical clustering results of 20 subtasks in perception.
  • Figure 5: The Visualization of hierarchical clustering results of 51 animal categories.