Table of Contents
Fetching ...

Vision Language Models as Values Detectors

Giulio Antonio Abbo, Tony Belpaeme

TL;DR

The paper examines whether vision-language large language models align with human perception in identifying relevant elements in home-scene images. It uses 12 diffusion-generated scenarios, 14 annotators, and five LLMs (GPT-4o and four LLaVA variants) to compare model outputs against human judgments. Findings show modest alignment, with LLaVA 36B performing best but well below reliable agreement, suggesting biases and the need for targeted fine-tuning and prompting to detect value-laden content. The study points to practical implications for social robotics, assistive tech, and human-computer interaction, emphasizing the importance of aligning AI interpretation with human values.

Abstract

Large Language Models integrating textual and visual inputs have introduced new possibilities for interpreting complex data. Despite their remarkable ability to generate coherent and contextually relevant text based on visual stimuli, the alignment of these models with human perception in identifying relevant elements in images requires further exploration. This paper investigates the alignment between state-of-the-art LLMs and human annotators in detecting elements of relevance within home environment scenarios. We created a set of twelve images depicting various domestic scenarios and enlisted fourteen annotators to identify the key element in each image. We then compared these human responses with outputs from five different LLMs, including GPT-4o and four LLaVA variants. Our findings reveal a varied degree of alignment, with LLaVA 34B showing the highest performance but still scoring low. However, an analysis of the results highlights the models' potential to detect value-laden elements in images, suggesting that with improved training and refined prompts, LLMs could enhance applications in social robotics, assistive technologies, and human-computer interaction by providing deeper insights and more contextually relevant responses.

Vision Language Models as Values Detectors

TL;DR

The paper examines whether vision-language large language models align with human perception in identifying relevant elements in home-scene images. It uses 12 diffusion-generated scenarios, 14 annotators, and five LLMs (GPT-4o and four LLaVA variants) to compare model outputs against human judgments. Findings show modest alignment, with LLaVA 36B performing best but well below reliable agreement, suggesting biases and the need for targeted fine-tuning and prompting to detect value-laden content. The study points to practical implications for social robotics, assistive tech, and human-computer interaction, emphasizing the importance of aligning AI interpretation with human values.

Abstract

Large Language Models integrating textual and visual inputs have introduced new possibilities for interpreting complex data. Despite their remarkable ability to generate coherent and contextually relevant text based on visual stimuli, the alignment of these models with human perception in identifying relevant elements in images requires further exploration. This paper investigates the alignment between state-of-the-art LLMs and human annotators in detecting elements of relevance within home environment scenarios. We created a set of twelve images depicting various domestic scenarios and enlisted fourteen annotators to identify the key element in each image. We then compared these human responses with outputs from five different LLMs, including GPT-4o and four LLaVA variants. Our findings reveal a varied degree of alignment, with LLaVA 34B showing the highest performance but still scoring low. However, an analysis of the results highlights the models' potential to detect value-laden elements in images, suggesting that with improved training and refined prompts, LLMs could enhance applications in social robotics, assistive technologies, and human-computer interaction by providing deeper insights and more contextually relevant responses.
Paper Structure (13 sections, 2 figures, 3 tables)

This paper contains 13 sections, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Four of the twelve images used in the evaluation.
  • Figure 2: Comparison of the element of focus alignments, with the 95% confidence interval.