Table of Contents
Fetching ...

VIVA: A Benchmark for Vision-Grounded Decision-Making with Human Values

Zhe Hu, Yixiao Ren, Jing Li, Yu Yin

TL;DR

This paper introduces VIVA, a benchmark for VIsion-grounded decision-making driven by human VA, and examines their multimodal capabilities in leveraging human values to make decisions under a vision-depicted situation.

Abstract

Large vision language models (VLMs) have demonstrated significant potential for integration into daily life, making it crucial for them to incorporate human values when making decisions in real-world situations. This paper introduces VIVA, a benchmark for VIsion-grounded decision-making driven by human VAlues. While most large VLMs focus on physical-level skills, our work is the first to examine their multimodal capabilities in leveraging human values to make decisions under a vision-depicted situation. VIVA contains 1,240 images depicting diverse real-world situations and the manually annotated decisions grounded in them. Given an image there, the model should select the most appropriate action to address the situation and provide the relevant human values and reason underlying the decision. Extensive experiments based on VIVA show the limitation of VLMs in using human values to make multimodal decisions. Further analyses indicate the potential benefits of exploiting action consequences and predicted human values.

VIVA: A Benchmark for Vision-Grounded Decision-Making with Human Values

TL;DR

This paper introduces VIVA, a benchmark for VIsion-grounded decision-making driven by human VA, and examines their multimodal capabilities in leveraging human values to make decisions under a vision-depicted situation.

Abstract

Large vision language models (VLMs) have demonstrated significant potential for integration into daily life, making it crucial for them to incorporate human values when making decisions in real-world situations. This paper introduces VIVA, a benchmark for VIsion-grounded decision-making driven by human VAlues. While most large VLMs focus on physical-level skills, our work is the first to examine their multimodal capabilities in leveraging human values to make decisions under a vision-depicted situation. VIVA contains 1,240 images depicting diverse real-world situations and the manually annotated decisions grounded in them. Given an image there, the model should select the most appropriate action to address the situation and provide the relevant human values and reason underlying the decision. Extensive experiments based on VIVA show the limitation of VLMs in using human values to make multimodal decisions. Further analyses indicate the potential benefits of exploiting action consequences and predicted human values.
Paper Structure (27 sections, 16 figures, 3 tables)

This paper contains 27 sections, 16 figures, 3 tables.

Figures (16)

  • Figure 1: Two vision-grounded decision-making examples with human values ($\,$$\,\,$). The best decision is in the blue box.
  • Figure 2: Instances of different tasks of our dataset. Our tasks assess the explicit actions taken and the underlying values and reason behind those actions.
  • Figure 3: The VIVA benchmark construction pipeline overview. The process begins with brainstorming diverse textual situation descriptions leveraging GPT. Then, we gather images corresponding to the situations described using image searches. After that, human annotators collaborate with GPT to write and verify the components for each task to ensure overall data quality.
  • Figure 4: Categories of situations covered by our dataset. The illustrations of each category is provided in Appendix \ref{['sec:category_illustration']}
  • Figure 5: Model accuracy (y-axis) on Level-1 action selection with the incorporation of oracle and predicted values.
  • ...and 11 more figures