Table of Contents
Fetching ...

PhishLang: A Real-Time, Fully Client-Side Phishing Detection Framework Using MobileBERT

Sayak Saha Roy, Shirin Nilizadeh

TL;DR

PhishLang presents a real-time, fully client-side phishing detector that leverages MobileBERT to analyze contextual signals from both a website's URL and source code. By parsing actionable HTML tags and combining URL and source code features in a lightweight ensemble, it achieves high accuracy with minimal resource demands while preserving user privacy. The framework is evaluated against a large ground-truth dataset and real-world live streams, showing strong performance and robustness to adversarial, problem-space evasion tactics, aided by parser-based patches and adversarial training. Its open-source Chromium extension demonstrates practical, private protection without reliance on external blocklists, making it a scalable addition to existing anti-phishing defenses.

Abstract

In this paper, we introduce PhishLang, the first fully client-side anti-phishing framework built on a lightweight ensemble framework that utilizes advanced language models to analyze the contextual features of a website's source code and URL. Unlike traditional heuristic or machine learning approaches that rely on static features and struggle to adapt to evolving threats, or deep learning models that are computationally intensive, our approach utilizes MobileBERT, a fast and memory-efficient variant of the BERT architecture, to capture nuanced features indicative of phishing attacks. To further enhance detection accuracy, PhishLang employs a multi-modal ensemble approach, combining both the URL and Source detection models. This architecture ensures robustness by allowing one model to compensate for scenarios where the other may fail, or if both models provide ambiguous inferences. As a result, PhishLang excels at detecting both regular and evasive phishing threats, including zero-day attacks, outperforming popular anti-phishing tools, while operating without relying on external blocklists and safeguarding user privacy by ensuring that browser history remains entirely local and unshared. We release PhishLang as a Chromium browser extension and also open-source the framework to aid the research community.

PhishLang: A Real-Time, Fully Client-Side Phishing Detection Framework Using MobileBERT

TL;DR

PhishLang presents a real-time, fully client-side phishing detector that leverages MobileBERT to analyze contextual signals from both a website's URL and source code. By parsing actionable HTML tags and combining URL and source code features in a lightweight ensemble, it achieves high accuracy with minimal resource demands while preserving user privacy. The framework is evaluated against a large ground-truth dataset and real-world live streams, showing strong performance and robustness to adversarial, problem-space evasion tactics, aided by parser-based patches and adversarial training. Its open-source Chromium extension demonstrates practical, private protection without reliance on external blocklists, making it a scalable addition to existing anti-phishing defenses.

Abstract

In this paper, we introduce PhishLang, the first fully client-side anti-phishing framework built on a lightweight ensemble framework that utilizes advanced language models to analyze the contextual features of a website's source code and URL. Unlike traditional heuristic or machine learning approaches that rely on static features and struggle to adapt to evolving threats, or deep learning models that are computationally intensive, our approach utilizes MobileBERT, a fast and memory-efficient variant of the BERT architecture, to capture nuanced features indicative of phishing attacks. To further enhance detection accuracy, PhishLang employs a multi-modal ensemble approach, combining both the URL and Source detection models. This architecture ensures robustness by allowing one model to compensate for scenarios where the other may fail, or if both models provide ambiguous inferences. As a result, PhishLang excels at detecting both regular and evasive phishing threats, including zero-day attacks, outperforming popular anti-phishing tools, while operating without relying on external blocklists and safeguarding user privacy by ensuring that browser history remains entirely local and unshared. We release PhishLang as a Chromium browser extension and also open-source the framework to aid the research community.
Paper Structure (21 sections, 8 figures, 9 tables)

This paper contains 21 sections, 8 figures, 9 tables.

Figures (8)

  • Figure 1: Example of a parsed snippet for a phishing website, with the parsed features mapped. The parsing focuses on extracting actionable features from the website.
  • Figure 2: Distritbution of: Libraries/Frameworks found in: A) Phishing websites, B) Benign Websites, and JS Main function calls found in: C) Phishing websites.
  • Figure 3: Example of an Instagram attack where the seemingly benign landing page (which was missed by PhishLang) led to a phishing page.
  • Figure 4: Example of a non-English (Japanese) false negative sample
  • Figure 5: An example of a false positive with poor layout
  • ...and 3 more figures