Different Victims, Same Layout: Email Visual Similarity Detection for Enhanced Email Protection
Sachin Shukla, Omid Mirzaei
TL;DR
Threat actors frequently reuse email kits, enabling layout-based evasion of rule-based and keyword-driven detectors. The authors introduce Pisco, a visual similarity pipeline that renders emails as screenshots, encodes visuals with CLIP embeddings, and retrieves visually similar emails via a vector database to classify new messages. In a month-long study, they identified over 20,000 visually similar clusters from 116k emails, with many clusters persisting across weeks or months, indicating broad reuse of email kits. The work demonstrates that leveraging historic visual elements can enhance protection beyond textual features and suggests extensions to threat intelligence, labeling, and campaign tracking.
Abstract
In the pursuit of an effective spam detection system, the focus has often been on identifying known spam patterns either through rule-based detection systems or machine learning (ML) solutions that rely on keywords. However, both systems are susceptible to evasion techniques and zero-day attacks that can be achieved at low cost. Therefore, an email that bypassed the defense system once can do it again in the following days, even though rules are updated or the ML models are retrained. The recurrence of failures to detect emails that exhibit layout similarities to previously undetected spam is concerning for customers and can erode their trust in a company. Our observations show that threat actors reuse email kits extensively and can bypass detection with little effort, for example, by making changes to the content of emails. In this work, we propose an email visual similarity detection approach, named Pisco, to improve the detection capabilities of an email threat defense system. We apply our proof of concept to some real-world samples received from different sources. Our results show that email kits are being reused extensively and visually similar emails are sent to our customers at various time intervals. Therefore, this method could be very helpful in situations where detection engines that rely on textual features and keywords are bypassed, an occurrence our observations show happens frequently.
