Table of Contents
Fetching ...

Characterizing Phishing Threats with Natural Language Processing

Michael C. Kotson, Alexia Schulz

TL;DR

Natural Language Processing techniques are used to investigate a specific real-world phishing campaign and quantify attributes that indicate a targeted spear phishing attack and demonstrate that individuals who were a primary focus of the campaign received CVs that are highly topically clustered.

Abstract

Spear phishing is a widespread concern in the modern network security landscape, but there are few metrics that measure the extent to which reconnaissance is performed on phishing targets. Spear phishing emails closely match the expectations of the recipient, based on details of their experiences and interests, making them a popular propagation vector for harmful malware. In this work we use Natural Language Processing techniques to investigate a specific real-world phishing campaign and quantify attributes that indicate a targeted spear phishing attack. Our phishing campaign data sample comprises 596 emails - all containing a web bug and a Curriculum Vitae (CV) PDF attachment - sent to our institution by a foreign IP space. The campaign was found to exclusively target specific demographics within our institution. Performing a semantic similarity analysis between the senders' CV attachments and the recipients' LinkedIn profiles, we conclude with high statistical certainty (p $< 10^{-4}$) that the attachments contain targeted rather than randomly selected material. Latent Semantic Analysis further demonstrates that individuals who were a primary focus of the campaign received CVs that are highly topically clustered. These findings differentiate this campaign from one that leverages random spam.

Characterizing Phishing Threats with Natural Language Processing

TL;DR

Natural Language Processing techniques are used to investigate a specific real-world phishing campaign and quantify attributes that indicate a targeted spear phishing attack and demonstrate that individuals who were a primary focus of the campaign received CVs that are highly topically clustered.

Abstract

Spear phishing is a widespread concern in the modern network security landscape, but there are few metrics that measure the extent to which reconnaissance is performed on phishing targets. Spear phishing emails closely match the expectations of the recipient, based on details of their experiences and interests, making them a popular propagation vector for harmful malware. In this work we use Natural Language Processing techniques to investigate a specific real-world phishing campaign and quantify attributes that indicate a targeted spear phishing attack. Our phishing campaign data sample comprises 596 emails - all containing a web bug and a Curriculum Vitae (CV) PDF attachment - sent to our institution by a foreign IP space. The campaign was found to exclusively target specific demographics within our institution. Performing a semantic similarity analysis between the senders' CV attachments and the recipients' LinkedIn profiles, we conclude with high statistical certainty (p ) that the attachments contain targeted rather than randomly selected material. Latent Semantic Analysis further demonstrates that individuals who were a primary focus of the campaign received CVs that are highly topically clustered. These findings differentiate this campaign from one that leverages random spam.

Paper Structure

This paper contains 10 sections, 2 equations, 10 figures.

Figures (10)

  • Figure 1: The distribution of phishing emails received by different groups at MIT Lincoln Laboratory. Percentages are not adjusted to account for the size of each group. Group I is small, but groups G and H are underrepresented in the campaign.
  • Figure 2: The distribution of phishing emails received by employees sorted by job title. Titles are divided into tiers, with level 1 for employees with the least experience and responsibility, and higher levels for employees with more experience and responsibility. Notably, all recipients held research or leadership positions.
  • Figure 3: Similarity distributions for all possible phisher/target pairs (blue) and all observed pairs (translucent red). The median and standard error of the mean are $(5.03\pm 0.04)\times 10^{-3}$ for the blue histogram and $(6.65\pm 0.20)\times 10^{-3}$ for the red, indicating that the email pairs share significantly more similarities than expected from a random spamming attack.
  • Figure 4: The two-sample Kolmogorov-Smirnov test comparing the similarity distributions in Figure \ref{['fig:twohist']}. Each curve is a cumulative probability distribution, and the black arrow marks their supremum difference of 0.224, which corresponds to a probability of $<$0.01% that the set of emails is a random sample of the set of all possible phisher/target pairs.
  • Figure 5: The similarity metric applied to the benchmark data sample. The gray histogram represents the distribution from an all-to-all comparison of the documents in the benchmark corpus, while the translucent red distribution shows document comparisons among Software Engineers only. The median and standard error of the mean are $(8.01\pm 0.05)\times 10^{-3}$ for the gray histogram and $(15.06\pm 0.20)\times 10^{-3}$ for the red.
  • ...and 5 more figures