Real-PGDN: A Two-level Classification Method for Full-Process Recognition of Newly Registered Pornographic and Gambling Domain Names
Hao Wang, Yingshuo Wang, Junang Gan, Yanan Cheng, Jinshuai Zhang
TL;DR
The paper addresses the challenge of accurately detecting newly registered pornographic and gambling domain names (PGDN) in real-world, feature-missing conditions. It introduces Real-PGDN, a two-level classifier that combines CoSENT (text embeddings), an MLP, and a Random Forest to achieve high precision on a large, real-world NRD2024 dataset collected over 20 days for 1.5 million domains. Key contributions include the NRD2024 dataset, a comprehensive 20-feature, six-record-type feature set, feature analysis and augmentation strategies, and an evaluation showing 0.9746 accuracy and 0.9788 precision, plus practical forecasting of PGDN usage with over 70% forecast success. The approach demonstrates strong potential for timely, robust PGDN recognition in operational environments and provides a valuable public dataset for future research.
Abstract
Online pornography and gambling have consistently posed regulatory challenges for governments, threatening both personal assets and privacy. Therefore, it is imperative to research the classification of the newly registered Pornographic and Gambling Domain Names (PGDN). However, scholarly investigation into this topic is limited. Previous efforts in PGDN classification pursue high accuracy using ideal sample data, while others employ up-to-date data from real-world scenarios but achieve lower classification accuracy. This paper introduces the Real-PGDN method, which accomplishes a complete process of timely and comprehensive real-data crawling, feature extraction with feature-missing tolerance, precise PGDN classification, and assessment of application effects in actual scenarios. Our two-level classifier, which integrates CoSENT (BERT-based), Multilayer Perceptron (MLP), and traditional classification algorithms, achieves a 97.88% precision. The research process amasses the NRD2024 dataset, which contains continuous detection information over 20 days for 1,500,000 newly registered domain names across 6 directions. Results from our case study demonstrate that this method also maintains a forecast precision of over 70% for PGDN that are delayed in usage after registration.
