PMANet: Malicious URL detection via post-trained language model guided multi-level feature attention network

Ruitong Liu; Yanbin Wang; Haitao Xu; Zhan Qin; Fan Zhang; Yiwei Liu; Zheng Cao

PMANet: Malicious URL detection via post-trained language model guided multi-level feature attention network

Ruitong Liu, Yanbin Wang, Haitao Xu, Zhan Qin, Fan Zhang, Yiwei Liu, Zheng Cao

TL;DR

PMANet tackles malicious URL detection by adapting a character-aware pre-trained language model to the URL domain through unsupervised post-training. It introduces a multi-level feature attention framework, including multi-order feature extraction, layer-aware attention, and spatial pyramid pooling, to fuse local and global cues across subword and character representations. The approach achieves state-of-the-art results across balanced and imbalanced datasets, multi-class classification, cross-dataset tests, and adversarial attacks, including an AUC of 0.9941 and perfect detection in a case study. The work demonstrates efficient transfer learning, strong generalization, and practical applicability with publicly available code.

Abstract

The proliferation of malicious URLs has made their detection crucial for enhancing network security. While pre-trained language models offer promise, existing methods struggle with domain-specific adaptability, character-level information, and local-global encoding integration. To address these challenges, we propose PMANet, a pre-trained Language Model-Guided multi-level feature attention network. PMANet employs a post-training process with three self-supervised objectives: masked language modeling, noisy language modeling, and domain discrimination, effectively capturing subword and character-level information. It also includes a hierarchical representation module and a dynamic layer-wise attention mechanism for extracting features from low to high levels. Additionally, spatial pyramid pooling integrates local and global features. Experiments on diverse scenarios, including small-scale data, class imbalance, and adversarial attacks, demonstrate PMANet's superiority over state-of-the-art models, achieving a 0.9941 AUC and correctly detecting all 20 malicious URLs in a case study. Code and data are available at https://github.com/Alixyvtte/Malicious-URL-Detection-PMANet.

PMANet: Malicious URL detection via post-trained language model guided multi-level feature attention network

TL;DR

Abstract

Paper Structure (18 sections, 16 equations, 8 figures, 6 tables)

This paper contains 18 sections, 16 equations, 8 figures, 6 tables.

Introduction
Related Work
Conventional DL-based Methods
Pre-trained LMs-based Methods
Malicious URL Data
Method of PMANet
Network Structure of PMANet
Post-training Method
Experiments
Evaluation of Multi-Layer Feature
Comparison with Baseline Methods
Binary classification
Multiple classification
Cross-Dataset Testing
Adversarial Evaluation
...and 3 more sections

Figures (8)

Figure 1: Histogram of URL length.
Figure 2: The overall workflow of PMANet. PMANet employs post-trained CharBERT on URLs to extract features at both the character and subword levels. Multi-order feature extraction modules then derive encoded representations ranging from low to high-level. Subsequently, layer-aware attention dynamically readjusts the weighting of different features, which is followed by spatial pyramid pooling that accentuates local nuances and consolidates global context.
Figure 3: The architecture diagram of BiGRU.
Figure 4: The architecture diagram of Heterogeneous Interaction.
Figure 5: The diagram of three unsupervised post-training tasks. Masked LM and Noisy LM are employed for learning the contextual semantics of URL subwords and character-level representations, respectively, while the domain discrimination task is responsible for domain adaptability learning.
...and 3 more figures

PMANet: Malicious URL detection via post-trained language model guided multi-level feature attention network

TL;DR

Abstract

PMANet: Malicious URL detection via post-trained language model guided multi-level feature attention network

Authors

TL;DR

Abstract

Table of Contents

Figures (8)