Table of Contents
Fetching ...

Visual Persuasion: What Influences Decisions of Vision-Language Models?

Manuel Cherep, Pranav M R, Pattie Maes, Nikhil Singh

TL;DR

It is argued that this approach offers a practical and efficient way to surface visual vulnerabilities, safety concerns that might otherwise be discovered implicitly in the wild, supporting more proactive auditing and governance of image-based AI agents.

Abstract

The web is littered with images, once created for human consumption and now increasingly interpreted by agents using vision-language models (VLMs). These agents make visual decisions at scale, deciding what to click, recommend, or buy. Yet, we know little about the structure of their visual preferences. We introduce a framework for studying this by placing VLMs in controlled image-based choice tasks and systematically perturbing their inputs. Our key idea is to treat the agent's decision function as a latent visual utility that can be inferred through revealed preference: choices between systematically edited images. Starting from common images, such as product photos, we propose methods for visual prompt optimization, adapting text optimization methods to iteratively propose and apply visually plausible modifications using an image generation model (such as in composition, lighting, or background). We then evaluate which edits increase selection probability. Through large-scale experiments on frontier VLMs, we demonstrate that optimized edits significantly shift choice probabilities in head-to-head comparisons. We develop an automatic interpretability pipeline to explain these preferences, identifying consistent visual themes that drive selection. We argue that this approach offers a practical and efficient way to surface visual vulnerabilities, safety concerns that might otherwise be discovered implicitly in the wild, supporting more proactive auditing and governance of image-based AI agents.

Visual Persuasion: What Influences Decisions of Vision-Language Models?

TL;DR

It is argued that this approach offers a practical and efficient way to surface visual vulnerabilities, safety concerns that might otherwise be discovered implicitly in the wild, supporting more proactive auditing and governance of image-based AI agents.

Abstract

The web is littered with images, once created for human consumption and now increasingly interpreted by agents using vision-language models (VLMs). These agents make visual decisions at scale, deciding what to click, recommend, or buy. Yet, we know little about the structure of their visual preferences. We introduce a framework for studying this by placing VLMs in controlled image-based choice tasks and systematically perturbing their inputs. Our key idea is to treat the agent's decision function as a latent visual utility that can be inferred through revealed preference: choices between systematically edited images. Starting from common images, such as product photos, we propose methods for visual prompt optimization, adapting text optimization methods to iteratively propose and apply visually plausible modifications using an image generation model (such as in composition, lighting, or background). We then evaluate which edits increase selection probability. Through large-scale experiments on frontier VLMs, we demonstrate that optimized edits significantly shift choice probabilities in head-to-head comparisons. We develop an automatic interpretability pipeline to explain these preferences, identifying consistent visual themes that drive selection. We argue that this approach offers a practical and efficient way to surface visual vulnerabilities, safety concerns that might otherwise be discovered implicitly in the wild, supporting more proactive auditing and governance of image-based AI agents.
Paper Structure (86 sections, 5 equations, 17 figures, 8 tables, 4 algorithms)

This paper contains 86 sections, 5 equations, 17 figures, 8 tables, 4 algorithms.

Figures (17)

  • Figure 1: Simplified overview of the iterative visual optimization process through feedback-driven prompt refinement. An original image is progressively improved over $K$ rounds. Each iteration, judges provide feedback with possible improvements, and an LLM uses the feedback to generate editing instructions. These instructions are applied with an image generation model to produce the candidate for the next round. The process stops after a certain number of rounds or if an equilibrium is reached. More examples in \ref{['fig:examples']}.
  • Figure 2: Estimated marginal mean probability of choice by task (columns) $\times$ optimization method (rows) and optimization stage (X-axis; original image, zero-shot modified, and final after optimization). Results are averaged across all VLMs.
  • Figure 3: Estimated marginal mean probability of choice for final optimized images produced by different optimization methods in head-to-head comparisons. Results are averaged across all VLMs with error bars showing 95% confidence intervals.
  • Figure 4: Effect of image normalization for $\kappa$ passes on est. probability of choosing the original vs. final variants.
  • Figure 5: Effect of image normalization on human choices, compared with original vs. final trials. After 3 passes, the probability of choosing the final optimized image decreases.
  • ...and 12 more figures

Theorems & Definitions (1)

  • Definition 3.1: Identity maintenance