Table of Contents
Fetching ...

Continuous Multi-Task Pre-training for Malicious URL Detection and Webpage Classification

Yujie Li, Yiwei Liu, Peiyue Li, Yifan Jia, Yanbin Wang

TL;DR

This work introduces urlBERT, a Transformer-based URL encoder pretrained on a massive unlabeled URL corpus using five specialized self-supervised tasks to capture URL structure, sequence, and semantics. A grouped sequential learning scheme organizes pretraining tasks into two stages, and a two-stage fine-tuning procedure enables stable, efficient adaptation for single-task and multi-task downstream URL classification tasks. Across phishing URL detection, advertising URL detection, and webpage topic classification, urlBERT consistently outperforms standard pretrained models and demonstrates robustness across data scales and task settings, including a near-parity multi-task performance. The approach offers a scalable, URL-native backbone for diverse security and web-content analysis applications, with practical considerations for efficiency and edge deployment through future compression and incremental-learning techniques.

Abstract

Malicious URL detection and webpage classification are critical tasks in cybersecurity and information management. In recent years, extensive research has explored using BERT or similar language models to replace traditional machine learning methods for detecting malicious URLs and classifying webpages. While previous studies show promising results, they often apply existing language models to these tasks without accounting for the inherent differences in domain data (e.g., URLs being loosely structured and semantically sparse compared to text), leaving room for performance improvement. Furthermore, current approaches focus on single tasks and have not been tested in multi-task scenarios. To address these challenges, we propose urlBERT, a pre-trained URL encoder leveraging Transformer to encode foundational knowledge from billions of unlabeled URLs. To achieve it, we propose to use 5 unsupervised pretraining tasks to capture multi-level information of URL lexical, syntax, and semantics, and generate contrastive and adversarial representations. Furthermore, to avoid inter-pre-training competition and interference, we proposed a grouped sequential learning method to ensure effective training across multi-tasks. Finally, we leverage a two-stage fine-tuning approach to improve the training stability and efficiency of the task model. To assess the multitasking potential of urlBERT, we fine-tune the task model in both single-task and multi-task modes. The former creates a classification model for a single task, while the latter builds a classification model capable of handling multiple tasks. We evaluate urlBERT on three downstream tasks: phishing URL detection, advertising URL detection, and webpage classification. The results demonstrate that urlBERT outperforms standard pre-trained models, and its multi-task mode is capable of addressing the real-world demands of multitasking.

Continuous Multi-Task Pre-training for Malicious URL Detection and Webpage Classification

TL;DR

This work introduces urlBERT, a Transformer-based URL encoder pretrained on a massive unlabeled URL corpus using five specialized self-supervised tasks to capture URL structure, sequence, and semantics. A grouped sequential learning scheme organizes pretraining tasks into two stages, and a two-stage fine-tuning procedure enables stable, efficient adaptation for single-task and multi-task downstream URL classification tasks. Across phishing URL detection, advertising URL detection, and webpage topic classification, urlBERT consistently outperforms standard pretrained models and demonstrates robustness across data scales and task settings, including a near-parity multi-task performance. The approach offers a scalable, URL-native backbone for diverse security and web-content analysis applications, with practical considerations for efficiency and edge deployment through future compression and incremental-learning techniques.

Abstract

Malicious URL detection and webpage classification are critical tasks in cybersecurity and information management. In recent years, extensive research has explored using BERT or similar language models to replace traditional machine learning methods for detecting malicious URLs and classifying webpages. While previous studies show promising results, they often apply existing language models to these tasks without accounting for the inherent differences in domain data (e.g., URLs being loosely structured and semantically sparse compared to text), leaving room for performance improvement. Furthermore, current approaches focus on single tasks and have not been tested in multi-task scenarios. To address these challenges, we propose urlBERT, a pre-trained URL encoder leveraging Transformer to encode foundational knowledge from billions of unlabeled URLs. To achieve it, we propose to use 5 unsupervised pretraining tasks to capture multi-level information of URL lexical, syntax, and semantics, and generate contrastive and adversarial representations. Furthermore, to avoid inter-pre-training competition and interference, we proposed a grouped sequential learning method to ensure effective training across multi-tasks. Finally, we leverage a two-stage fine-tuning approach to improve the training stability and efficiency of the task model. To assess the multitasking potential of urlBERT, we fine-tune the task model in both single-task and multi-task modes. The former creates a classification model for a single task, while the latter builds a classification model capable of handling multiple tasks. We evaluate urlBERT on three downstream tasks: phishing URL detection, advertising URL detection, and webpage classification. The results demonstrate that urlBERT outperforms standard pre-trained models, and its multi-task mode is capable of addressing the real-world demands of multitasking.
Paper Structure (29 sections, 8 equations, 3 figures, 7 tables, 3 algorithms)

This paper contains 29 sections, 8 equations, 3 figures, 7 tables, 3 algorithms.

Figures (3)

  • Figure 1: The pre-training task framework of urlBERT. In the first phase, the model is trained on the RSTD task, which involves detecting shuffled and replaced tokens. In the second phase, training targets Masked Language Model, virtual adversarial training, and contrastive learning for pre-training tasks.
  • Figure 2: ROC curves for urlBERT trained on training sets of different sizes.
  • Figure 3: Performance comparison across training epochs for urlBERT, BiLSTM, and TextCNN. Metrics include Accuracy, Precision, and F1-Score. All models are evaluated on the phishing URL detection task.