PMANet: Malicious URL detection via post-trained language model guided multi-level feature attention network
Ruitong Liu, Yanbin Wang, Haitao Xu, Zhan Qin, Fan Zhang, Yiwei Liu, Zheng Cao
TL;DR
PMANet tackles malicious URL detection by adapting a character-aware pre-trained language model to the URL domain through unsupervised post-training. It introduces a multi-level feature attention framework, including multi-order feature extraction, layer-aware attention, and spatial pyramid pooling, to fuse local and global cues across subword and character representations. The approach achieves state-of-the-art results across balanced and imbalanced datasets, multi-class classification, cross-dataset tests, and adversarial attacks, including an AUC of 0.9941 and perfect detection in a case study. The work demonstrates efficient transfer learning, strong generalization, and practical applicability with publicly available code.
Abstract
The proliferation of malicious URLs has made their detection crucial for enhancing network security. While pre-trained language models offer promise, existing methods struggle with domain-specific adaptability, character-level information, and local-global encoding integration. To address these challenges, we propose PMANet, a pre-trained Language Model-Guided multi-level feature attention network. PMANet employs a post-training process with three self-supervised objectives: masked language modeling, noisy language modeling, and domain discrimination, effectively capturing subword and character-level information. It also includes a hierarchical representation module and a dynamic layer-wise attention mechanism for extracting features from low to high levels. Additionally, spatial pyramid pooling integrates local and global features. Experiments on diverse scenarios, including small-scale data, class imbalance, and adversarial attacks, demonstrate PMANet's superiority over state-of-the-art models, achieving a 0.9941 AUC and correctly detecting all 20 malicious URLs in a case study. Code and data are available at https://github.com/Alixyvtte/Malicious-URL-Detection-PMANet.
