Table of Contents
Fetching ...

A New Dataset and Methodology for Malicious URL Classification

Ilan Schvartzman, Roei Sarussi, Maor Ashkenazi, Ido kringel, Yaniv Tocker, Tal Furman Shohet

TL;DR

Malicious URL classification is challenged by data scarcity and rapid threat evolution. The authors present DeepURLBench, a large multi-class dataset (benign, phishing, malware) enriched with DNS responses and a time-based evaluation framework, and enhance URLNet with global lexical and DNS features using a multi-class loss. Experimental results show the augmented URLNet variants achieving higher AUC and recall while maintaining real-time performance, with temporal analysis revealing degradation that motivates frequent retraining. The work provides a practical, scalable dataset and modeling strategy that improves robustness and applicability of URL classification in evolving web security environments.

Abstract

Malicious URL (Uniform Resource Locator) classification is a pivotal aspect of Cybersecurity, offering defense against web-based threats. Despite deep learning's promise in this area, its advancement is hindered by two main challenges: the scarcity of comprehensive, open-source datasets and the limitations of existing models, which either lack real-time capabilities or exhibit suboptimal performance. In order to address these gaps, we introduce a novel, multi-class dataset for malicious URL classification, distinguishing between benign, phishing and malicious URLs, named DeepURLBench. The data has been rigorously cleansed and structured, providing a superior alternative to existing datasets. Notably, the multi-class approach enhances the performance of deep learning models, as compared to a standard binary classification approach. Additionally, we propose improvements to string-based URL classifiers, applying these enhancements to URLNet. Key among these is the integration of DNS-derived features, which enrich the model's capabilities and lead to notable performance gains while preserving real-time runtime efficiency-achieving an effective balance for cybersecurity applications.

A New Dataset and Methodology for Malicious URL Classification

TL;DR

Malicious URL classification is challenged by data scarcity and rapid threat evolution. The authors present DeepURLBench, a large multi-class dataset (benign, phishing, malware) enriched with DNS responses and a time-based evaluation framework, and enhance URLNet with global lexical and DNS features using a multi-class loss. Experimental results show the augmented URLNet variants achieving higher AUC and recall while maintaining real-time performance, with temporal analysis revealing degradation that motivates frequent retraining. The work provides a practical, scalable dataset and modeling strategy that improves robustness and applicability of URL classification in evolving web security environments.

Abstract

Malicious URL (Uniform Resource Locator) classification is a pivotal aspect of Cybersecurity, offering defense against web-based threats. Despite deep learning's promise in this area, its advancement is hindered by two main challenges: the scarcity of comprehensive, open-source datasets and the limitations of existing models, which either lack real-time capabilities or exhibit suboptimal performance. In order to address these gaps, we introduce a novel, multi-class dataset for malicious URL classification, distinguishing between benign, phishing and malicious URLs, named DeepURLBench. The data has been rigorously cleansed and structured, providing a superior alternative to existing datasets. Notably, the multi-class approach enhances the performance of deep learning models, as compared to a standard binary classification approach. Additionally, we propose improvements to string-based URL classifiers, applying these enhancements to URLNet. Key among these is the integration of DNS-derived features, which enrich the model's capabilities and lead to notable performance gains while preserving real-time runtime efficiency-achieving an effective balance for cybersecurity applications.
Paper Structure (28 sections, 1 theorem, 9 figures, 4 tables)

This paper contains 28 sections, 1 theorem, 9 figures, 4 tables.

Key Result

Theorem 1

For a malicious URL classification method to be classified as real-time, it must sequentially classify 50 URLs, in less than 0.5s (or a single URL in less than 10ms on average).

Figures (9)

  • Figure 1: Histogram depicting the percentage of URLs detected as a non-safe tag by any number of vendors.
  • Figure 2: Histogram of the occurrence rate of agreement between different vendors on the same verdicts out of all potentially non-safe URL's
  • Figure 3: Histogram showing the number of URLs in the test set by their first appearance date in VirusTotal.
  • Figure 4: A browser screen with the developer tools panel. This shows a close-up view of requests initiated by the browser (marked in red), the total number of requests (marked in green) and the requests timeline (marked in blue). Note how a single web page request leads to additional URL requests, resulting in a substantial overall request count for the session.
  • Figure 5: Histogram depicting the number of requests initiated when browsing the top 1500 websites worldwide, as ranked by alexa_top_million.
  • ...and 4 more figures

Theorems & Definitions (1)

  • Theorem 1