Table of Contents
Fetching ...

Mitigating Bias in Machine Learning Models for Phishing Webpage Detection

Aditya Kulkarni, Vivek Balachandran, Dinil Mon Divakaran, Tamal Das

TL;DR

Targeting phishing detection, the paper addresses bias in training data that arises from class imbalance and limited diversity. It proposes a Phishing Webpage Generation Tool that takes a legitimate URL and injects random content and visual phishing features to create synthetic phishing pages, balancing datasets and enabling evaluation of detectors trained on narrow data. The contribution lies in identifying data-generation as a means to assess and improve generalization to zero-day phishing and to stress-test existing solutions. The approach has practical impact by providing a scalable way to generate diverse phishing data for benchmarking and bias mitigation.

Abstract

The widespread accessibility of the Internet has led to a surge in online fraudulent activities, underscoring the necessity of shielding users' sensitive information from cybercriminals. Phishing, a well-known cyberattack, revolves around the creation of phishing webpages and the dissemination of corresponding URLs, aiming to deceive users into sharing their sensitive information, often for identity theft or financial gain. Various techniques are available for preemptively categorizing zero-day phishing URLs by distilling unique attributes and constructing predictive models. However, these existing techniques encounter unresolved issues. This proposal delves into persistent challenges within phishing detection solutions, particularly concentrated on the preliminary phase of assembling comprehensive datasets, and proposes a potential solution in the form of a tool engineered to alleviate bias in ML models. Such a tool can generate phishing webpages for any given set of legitimate URLs, infusing randomly selected content and visual-based phishing features. Furthermore, we contend that the tool holds the potential to assess the efficacy of existing phishing detection solutions, especially those trained on confined datasets.

Mitigating Bias in Machine Learning Models for Phishing Webpage Detection

TL;DR

Targeting phishing detection, the paper addresses bias in training data that arises from class imbalance and limited diversity. It proposes a Phishing Webpage Generation Tool that takes a legitimate URL and injects random content and visual phishing features to create synthetic phishing pages, balancing datasets and enabling evaluation of detectors trained on narrow data. The contribution lies in identifying data-generation as a means to assess and improve generalization to zero-day phishing and to stress-test existing solutions. The approach has practical impact by providing a scalable way to generate diverse phishing data for benchmarking and bias mitigation.

Abstract

The widespread accessibility of the Internet has led to a surge in online fraudulent activities, underscoring the necessity of shielding users' sensitive information from cybercriminals. Phishing, a well-known cyberattack, revolves around the creation of phishing webpages and the dissemination of corresponding URLs, aiming to deceive users into sharing their sensitive information, often for identity theft or financial gain. Various techniques are available for preemptively categorizing zero-day phishing URLs by distilling unique attributes and constructing predictive models. However, these existing techniques encounter unresolved issues. This proposal delves into persistent challenges within phishing detection solutions, particularly concentrated on the preliminary phase of assembling comprehensive datasets, and proposes a potential solution in the form of a tool engineered to alleviate bias in ML models. Such a tool can generate phishing webpages for any given set of legitimate URLs, infusing randomly selected content and visual-based phishing features. Furthermore, we contend that the tool holds the potential to assess the efficacy of existing phishing detection solutions, especially those trained on confined datasets.
Paper Structure (5 sections, 4 figures)

This paper contains 5 sections, 4 figures.

Figures (4)

  • Figure 1: Phishing Attacks from Jan $2019$ to Dec $2022$: APWG APWG_REPORT_4_2022
  • Figure 2: Phishing Webpage Generation Tool
  • Figure 3: Legitimate Webpage
  • Figure 4: Generated Phishing Webpage