Table of Contents
Fetching ...

TransURL: Improving malicious URL detection with multi-layer Transformer encoding and multi-scale pyramid features

Ruitong Liu, Yanbin Wang, Zhenhao Guo, Haitao Xu, Zhan Qin, Wenrui Ma, Fan Zhang

TL;DR

This paper addresses malicious URL detection, where standard Transformers struggle to capture local, character-level cues and URL hierarchy. It proposes TransURL, a character-aware Transformer built on CharBERT with three integrated modules: Multi-Layer Encoding, Multi-Scale Feature Learning, and Spatial Pyramid Attention. Experiments show state-of-the-art performance across class-imbalanced data, multi-class tasks, cross-dataset generalization, and adversarial robustness, with notable gains in F1 and AUC. The approach demonstrates strong practical value for real-world deployment, as evidenced by a case study and publicly available code and data.

Abstract

Machine learning progress is advancing the detection of malicious URLs. However, advanced Transformers applied to URLs face difficulties in extracting local information, character-level details, and structural relationships. To address these challenges, we propose a novel approach for malicious URL detection, named TransURL. This method is implemented by co-training the character-aware Transformer with three feature modules: Multi-Layer Encoding, Multi-Scale Feature Learning, and Spatial Pyramid Attention. This specialized Transformer enables TransURL to extract embeddings with character-level information from URL token sequences, with the three modules aiding the fusion of multi-layer Transformer encodings and the capture of multi-scale local details and structural relationships. The proposed method is evaluated across several challenging scenarios, including class imbalance learning, multi-classification, cross-dataset testing, and adversarial sample attacks. Experimental results demonstrate a significant improvement compared to previous methods. For instance, it achieved a peak F1-score improvement of 40% in class-imbalanced scenarios and surpassed the best baseline by 14.13% in accuracy for adversarial attack scenarios. Additionally, a case study demonstrated that our method accurately identified all 30 active malicious web pages, whereas two previous state-of-the-art methods missed 4 and 7 malicious web pages, respectively. The codes and data are available at: https://github.com/Vul-det/TransURL/.

TransURL: Improving malicious URL detection with multi-layer Transformer encoding and multi-scale pyramid features

TL;DR

This paper addresses malicious URL detection, where standard Transformers struggle to capture local, character-level cues and URL hierarchy. It proposes TransURL, a character-aware Transformer built on CharBERT with three integrated modules: Multi-Layer Encoding, Multi-Scale Feature Learning, and Spatial Pyramid Attention. Experiments show state-of-the-art performance across class-imbalanced data, multi-class tasks, cross-dataset generalization, and adversarial robustness, with notable gains in F1 and AUC. The approach demonstrates strong practical value for real-world deployment, as evidenced by a case study and publicly available code and data.

Abstract

Machine learning progress is advancing the detection of malicious URLs. However, advanced Transformers applied to URLs face difficulties in extracting local information, character-level details, and structural relationships. To address these challenges, we propose a novel approach for malicious URL detection, named TransURL. This method is implemented by co-training the character-aware Transformer with three feature modules: Multi-Layer Encoding, Multi-Scale Feature Learning, and Spatial Pyramid Attention. This specialized Transformer enables TransURL to extract embeddings with character-level information from URL token sequences, with the three modules aiding the fusion of multi-layer Transformer encodings and the capture of multi-scale local details and structural relationships. The proposed method is evaluated across several challenging scenarios, including class imbalance learning, multi-classification, cross-dataset testing, and adversarial sample attacks. Experimental results demonstrate a significant improvement compared to previous methods. For instance, it achieved a peak F1-score improvement of 40% in class-imbalanced scenarios and surpassed the best baseline by 14.13% in accuracy for adversarial attack scenarios. Additionally, a case study demonstrated that our method accurately identified all 30 active malicious web pages, whereas two previous state-of-the-art methods missed 4 and 7 malicious web pages, respectively. The codes and data are available at: https://github.com/Vul-det/TransURL/.
Paper Structure (20 sections, 21 equations, 7 figures, 8 tables)

This paper contains 20 sections, 21 equations, 7 figures, 8 tables.

Figures (7)

  • Figure 1: Some major parts in a URL.
  • Figure 2: A Network Structure Diagram of the BiGRU Module.
  • Figure 3: TransURL: Composed of Four Core Components. CharBERT, the backbone network for learning character and subword-level features; Encoder Feature Extractor, Multi-Scale Feature Learning, and Spatial Pyramid Attention module for acquiring multi-order, multi-scale, attention-weighted features.
  • Figure 4: The architecture of Heterogeneous Interaction Module.
  • Figure 5: Detection results of baseline methods and TransURL on GramBeddings dataset.
  • ...and 2 more figures