Turning Adversaries into Allies: Reversing Typographic Attacks for Multimodal E-Commerce Product Retrieval
Janet Jenq, Hongda Shen
TL;DR
The paper tackles vulnerabilities of vision–language models to typographic attacks in e-commerce search by proposing a practical defense: render product metadata directly onto product images to strengthen image–text alignment. It introduces three inference modes that leverage text-rendered images and uses an adaptive GetMaxFontSize algorithm, along with LLM-based text summarization to keep inputs compact. Across three proprietary datasets and multiple encoders, the approach yields consistent improvements in Acc@1 and Acc@3, with SigLIP and PE showing notable gains and CLIP variants also benefitting; findings highlight the importance of typography, especially font size and center placement. The method requires minimal latency and no model training, making it well-suited for real-world deployment in zero-shot multimodal product retrieval systems.
Abstract
Multimodal product retrieval systems in e-commerce platforms rely on effectively combining visual and textual signals to improve search relevance and user experience. However, vision-language models such as CLIP are vulnerable to typographic attacks, where misleading or irrelevant text embedded in images skews model predictions. In this work, we propose a novel method that reverses the logic of typographic attacks by rendering relevant textual content (e.g., titles, descriptions) directly onto product images to perform vision-text compression, thereby strengthening image-text alignment and boosting multimodal product retrieval performance. We evaluate our method on three vertical-specific e-commerce datasets (sneakers, handbags, and trading cards) using six state-of-the-art vision foundation models. Our experiments demonstrate consistent improvements in unimodal and multimodal retrieval accuracy across categories and model families. Our findings suggest that visually rendering product metadata is a simple yet effective enhancement for zero-shot multimodal retrieval in e-commerce applications.
