Table of Contents
Fetching ...

Don't Buy it! Reassessing the Ad Understanding Abilities of Contrastive Multimodal Models

A. Bavaresco, A. Testoni, R. Fernández

TL;DR

This study scrutinizes automatic ad understanding by contrasting the original Pitt Ads retrieval setup with a grounded adversarial evaluation. It shows that zero-shot contrastive VLMs rely primarily on textual and visual grounding rather than deep multimodal reasoning, by using a carefully designed TRADE benchmark where adversarial explanations fool models but not humans. The authors demonstrate near-chance performance on TRADE across several VLMs and only modest gains on grounded control variants, while humans maintain high accuracy. The work highlights the need for robust evaluation protocols and suggests future work toward generative or differently calibrated assessments of multimodal ad understanding with careful control of grounding cues.

Abstract

Image-based advertisements are complex multimodal stimuli that often contain unusual visual elements and figurative language. Previous research on automatic ad understanding has reported impressive zero-shot accuracy of contrastive vision-and-language models (VLMs) on an ad-explanation retrieval task. Here, we examine the original task setup and show that contrastive VLMs can solve it by exploiting grounding heuristics. To control for this confound, we introduce TRADE, a new evaluation test set with adversarial grounded explanations. While these explanations look implausible to humans, we show that they "fool" four different contrastive VLMs. Our findings highlight the need for an improved operationalisation of automatic ad understanding that truly evaluates VLMs' multimodal reasoning abilities. We make our code and TRADE available at https://github.com/dmg-illc/trade .

Don't Buy it! Reassessing the Ad Understanding Abilities of Contrastive Multimodal Models

TL;DR

This study scrutinizes automatic ad understanding by contrasting the original Pitt Ads retrieval setup with a grounded adversarial evaluation. It shows that zero-shot contrastive VLMs rely primarily on textual and visual grounding rather than deep multimodal reasoning, by using a carefully designed TRADE benchmark where adversarial explanations fool models but not humans. The authors demonstrate near-chance performance on TRADE across several VLMs and only modest gains on grounded control variants, while humans maintain high accuracy. The work highlights the need for robust evaluation protocols and suggests future work toward generative or differently calibrated assessments of multimodal ad understanding with careful control of grounding cues.

Abstract

Image-based advertisements are complex multimodal stimuli that often contain unusual visual elements and figurative language. Previous research on automatic ad understanding has reported impressive zero-shot accuracy of contrastive vision-and-language models (VLMs) on an ad-explanation retrieval task. Here, we examine the original task setup and show that contrastive VLMs can solve it by exploiting grounding heuristics. To control for this confound, we introduce TRADE, a new evaluation test set with adversarial grounded explanations. While these explanations look implausible to humans, we show that they "fool" four different contrastive VLMs. Our findings highlight the need for an improved operationalisation of automatic ad understanding that truly evaluates VLMs' multimodal reasoning abilities. We make our code and TRADE available at https://github.com/dmg-illc/trade .
Paper Structure (21 sections, 4 figures, 4 tables)

This paper contains 21 sections, 4 figures, 4 tables.

Figures (4)

  • Figure 1: An example of the ad explanation retrieval task with the original setup vs. our new setup. The matching explanations are marked in italics. In the original setup, negatives are randomly sampled (5 out of 12 are shown for conciseness); in our setup, negatives are carefully curated to be textually and visually grounded in the ad but, at the same time, clearly incompatible with it. Brand names and logos are edited out in the examples present in this paper for presentation purposes, but are in fact visible in both task setups ([wbn] stands for "wrong brand name").
  • Figure 2: Ad explanations selected by human annotators vs. our tested models for one instance from TRADE. Italic indicates the matching explanation. Brands and logos are edited out in the paper examples for presentation purposes but are visible to models and human annotators.
  • Figure 3: Boxplots summarizing the distribution of grounding scores computed for positive explanations in TRADE. The blue dots indicate the scores for the positive explanations correctly selected by all VLMs. The object mention score is not included because its median coincides with the quartiles.
  • Figure 4: Examples from TRADE and TRADE-control, along with our transcription of the text (just for readability, not part of the dataset). Brands and logos are edited out in the paper examples for presentation purposes but are visible in TRADE.