Table of Contents
Fetching ...

Turning Adversaries into Allies: Reversing Typographic Attacks for Multimodal E-Commerce Product Retrieval

Janet Jenq, Hongda Shen

TL;DR

The paper tackles vulnerabilities of vision–language models to typographic attacks in e-commerce search by proposing a practical defense: render product metadata directly onto product images to strengthen image–text alignment. It introduces three inference modes that leverage text-rendered images and uses an adaptive GetMaxFontSize algorithm, along with LLM-based text summarization to keep inputs compact. Across three proprietary datasets and multiple encoders, the approach yields consistent improvements in Acc@1 and Acc@3, with SigLIP and PE showing notable gains and CLIP variants also benefitting; findings highlight the importance of typography, especially font size and center placement. The method requires minimal latency and no model training, making it well-suited for real-world deployment in zero-shot multimodal product retrieval systems.

Abstract

Multimodal product retrieval systems in e-commerce platforms rely on effectively combining visual and textual signals to improve search relevance and user experience. However, vision-language models such as CLIP are vulnerable to typographic attacks, where misleading or irrelevant text embedded in images skews model predictions. In this work, we propose a novel method that reverses the logic of typographic attacks by rendering relevant textual content (e.g., titles, descriptions) directly onto product images to perform vision-text compression, thereby strengthening image-text alignment and boosting multimodal product retrieval performance. We evaluate our method on three vertical-specific e-commerce datasets (sneakers, handbags, and trading cards) using six state-of-the-art vision foundation models. Our experiments demonstrate consistent improvements in unimodal and multimodal retrieval accuracy across categories and model families. Our findings suggest that visually rendering product metadata is a simple yet effective enhancement for zero-shot multimodal retrieval in e-commerce applications.

Turning Adversaries into Allies: Reversing Typographic Attacks for Multimodal E-Commerce Product Retrieval

TL;DR

The paper tackles vulnerabilities of vision–language models to typographic attacks in e-commerce search by proposing a practical defense: render product metadata directly onto product images to strengthen image–text alignment. It introduces three inference modes that leverage text-rendered images and uses an adaptive GetMaxFontSize algorithm, along with LLM-based text summarization to keep inputs compact. Across three proprietary datasets and multiple encoders, the approach yields consistent improvements in Acc@1 and Acc@3, with SigLIP and PE showing notable gains and CLIP variants also benefitting; findings highlight the importance of typography, especially font size and center placement. The method requires minimal latency and no model training, making it well-suited for real-world deployment in zero-shot multimodal product retrieval systems.

Abstract

Multimodal product retrieval systems in e-commerce platforms rely on effectively combining visual and textual signals to improve search relevance and user experience. However, vision-language models such as CLIP are vulnerable to typographic attacks, where misleading or irrelevant text embedded in images skews model predictions. In this work, we propose a novel method that reverses the logic of typographic attacks by rendering relevant textual content (e.g., titles, descriptions) directly onto product images to perform vision-text compression, thereby strengthening image-text alignment and boosting multimodal product retrieval performance. We evaluate our method on three vertical-specific e-commerce datasets (sneakers, handbags, and trading cards) using six state-of-the-art vision foundation models. Our experiments demonstrate consistent improvements in unimodal and multimodal retrieval accuracy across categories and model families. Our findings suggest that visually rendering product metadata is a simple yet effective enhancement for zero-shot multimodal retrieval in e-commerce applications.

Paper Structure

This paper contains 12 sections, 13 figures, 3 tables, 1 algorithm.

Figures (13)

  • Figure 1: Visualization of the impact of title rendering on image representations, shown through changes in the attention sensitivity heatmap for the OpenAI CLIP ViT-Large model.
  • Figure 2: Overview of the proposed method, including two new inference modes that leverage text-rendered images for improved multimodal retrieval.
  • Figure 3: Illustration of typographic factors across the datasets used in our experiments: (a) Font Size Ratio, (b) Font Color, and (c) Rendered Text Location. Rendered text regions are highlighted with red boxes.
  • Figure 4: $Acc@1$ and $Acc@3$ across three typographic factors (Font Size Ratio, Font Color, and Rendered Text Location) for four models (SigLIP, SigLIP 2, PE, and CLIP) on the Handbags dataset. Rows correspond to the three typographic factors, while columns report $Acc@1$ and $Acc@3$ under inference settings B and C, respectively.
  • Figure 5: Example from the sneakers category showing the query and retrieved products under raw (top row) and title-rendered (bottom row) conditions. The top row shows the original query image and three retrieved product images. The bottom row shows the corresponding images with listing titles rendered using our proposed method. The correct product match is highlighted with a green box.
  • ...and 8 more figures