A New Dataset and Methodology for Malicious URL Classification

Ilan Schvartzman; Roei Sarussi; Maor Ashkenazi; Ido kringel; Yaniv Tocker; Tal Furman Shohet

A New Dataset and Methodology for Malicious URL Classification

Ilan Schvartzman, Roei Sarussi, Maor Ashkenazi, Ido kringel, Yaniv Tocker, Tal Furman Shohet

TL;DR

Malicious URL classification is challenged by data scarcity and rapid threat evolution. The authors present DeepURLBench, a large multi-class dataset (benign, phishing, malware) enriched with DNS responses and a time-based evaluation framework, and enhance URLNet with global lexical and DNS features using a multi-class loss. Experimental results show the augmented URLNet variants achieving higher AUC and recall while maintaining real-time performance, with temporal analysis revealing degradation that motivates frequent retraining. The work provides a practical, scalable dataset and modeling strategy that improves robustness and applicability of URL classification in evolving web security environments.

Abstract

Malicious URL (Uniform Resource Locator) classification is a pivotal aspect of Cybersecurity, offering defense against web-based threats. Despite deep learning's promise in this area, its advancement is hindered by two main challenges: the scarcity of comprehensive, open-source datasets and the limitations of existing models, which either lack real-time capabilities or exhibit suboptimal performance. In order to address these gaps, we introduce a novel, multi-class dataset for malicious URL classification, distinguishing between benign, phishing and malicious URLs, named DeepURLBench. The data has been rigorously cleansed and structured, providing a superior alternative to existing datasets. Notably, the multi-class approach enhances the performance of deep learning models, as compared to a standard binary classification approach. Additionally, we propose improvements to string-based URL classifiers, applying these enhancements to URLNet. Key among these is the integration of DNS-derived features, which enrich the model's capabilities and lead to notable performance gains while preserving real-time runtime efficiency-achieving an effective balance for cybersecurity applications.

A New Dataset and Methodology for Malicious URL Classification

TL;DR

Abstract

Paper Structure (28 sections, 1 theorem, 9 figures, 4 tables)

This paper contains 28 sections, 1 theorem, 9 figures, 4 tables.

Introduction
Related Work
Existing Datasets
Methods for Malicious URL Classification
DeepURLBench
Data Sources
Labels
Labeling Criteria
Detection Rate
Coverage
Labeling Criteria
DNS Response
Preprocessing and Data Curation
Temporal considerations
Reputation building time
...and 13 more sections

Key Result

Theorem 1

For a malicious URL classification method to be classified as real-time, it must sequentially classify 50 URLs, in less than 0.5s (or a single URL in less than 10ms on average).

Figures (9)

Figure 1: Histogram depicting the percentage of URLs detected as a non-safe tag by any number of vendors.
Figure 2: Histogram of the occurrence rate of agreement between different vendors on the same verdicts out of all potentially non-safe URL's
Figure 3: Histogram showing the number of URLs in the test set by their first appearance date in VirusTotal.
Figure 4: A browser screen with the developer tools panel. This shows a close-up view of requests initiated by the browser (marked in red), the total number of requests (marked in green) and the requests timeline (marked in blue). Note how a single web page request leads to additional URL requests, resulting in a substantial overall request count for the session.
Figure 5: Histogram depicting the number of requests initiated when browsing the top 1500 websites worldwide, as ranked by alexa_top_million.
...and 4 more figures

Theorems & Definitions (1)

Theorem 1

A New Dataset and Methodology for Malicious URL Classification

TL;DR

Abstract

A New Dataset and Methodology for Malicious URL Classification

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (1)