Table of Contents
Fetching ...

Real-PGDN: A Two-level Classification Method for Full-Process Recognition of Newly Registered Pornographic and Gambling Domain Names

Hao Wang, Yingshuo Wang, Junang Gan, Yanan Cheng, Jinshuai Zhang

TL;DR

The paper addresses the challenge of accurately detecting newly registered pornographic and gambling domain names (PGDN) in real-world, feature-missing conditions. It introduces Real-PGDN, a two-level classifier that combines CoSENT (text embeddings), an MLP, and a Random Forest to achieve high precision on a large, real-world NRD2024 dataset collected over 20 days for 1.5 million domains. Key contributions include the NRD2024 dataset, a comprehensive 20-feature, six-record-type feature set, feature analysis and augmentation strategies, and an evaluation showing 0.9746 accuracy and 0.9788 precision, plus practical forecasting of PGDN usage with over 70% forecast success. The approach demonstrates strong potential for timely, robust PGDN recognition in operational environments and provides a valuable public dataset for future research.

Abstract

Online pornography and gambling have consistently posed regulatory challenges for governments, threatening both personal assets and privacy. Therefore, it is imperative to research the classification of the newly registered Pornographic and Gambling Domain Names (PGDN). However, scholarly investigation into this topic is limited. Previous efforts in PGDN classification pursue high accuracy using ideal sample data, while others employ up-to-date data from real-world scenarios but achieve lower classification accuracy. This paper introduces the Real-PGDN method, which accomplishes a complete process of timely and comprehensive real-data crawling, feature extraction with feature-missing tolerance, precise PGDN classification, and assessment of application effects in actual scenarios. Our two-level classifier, which integrates CoSENT (BERT-based), Multilayer Perceptron (MLP), and traditional classification algorithms, achieves a 97.88% precision. The research process amasses the NRD2024 dataset, which contains continuous detection information over 20 days for 1,500,000 newly registered domain names across 6 directions. Results from our case study demonstrate that this method also maintains a forecast precision of over 70% for PGDN that are delayed in usage after registration.

Real-PGDN: A Two-level Classification Method for Full-Process Recognition of Newly Registered Pornographic and Gambling Domain Names

TL;DR

The paper addresses the challenge of accurately detecting newly registered pornographic and gambling domain names (PGDN) in real-world, feature-missing conditions. It introduces Real-PGDN, a two-level classifier that combines CoSENT (text embeddings), an MLP, and a Random Forest to achieve high precision on a large, real-world NRD2024 dataset collected over 20 days for 1.5 million domains. Key contributions include the NRD2024 dataset, a comprehensive 20-feature, six-record-type feature set, feature analysis and augmentation strategies, and an evaluation showing 0.9746 accuracy and 0.9788 precision, plus practical forecasting of PGDN usage with over 70% forecast success. The approach demonstrates strong potential for timely, robust PGDN recognition in operational environments and provides a valuable public dataset for future research.

Abstract

Online pornography and gambling have consistently posed regulatory challenges for governments, threatening both personal assets and privacy. Therefore, it is imperative to research the classification of the newly registered Pornographic and Gambling Domain Names (PGDN). However, scholarly investigation into this topic is limited. Previous efforts in PGDN classification pursue high accuracy using ideal sample data, while others employ up-to-date data from real-world scenarios but achieve lower classification accuracy. This paper introduces the Real-PGDN method, which accomplishes a complete process of timely and comprehensive real-data crawling, feature extraction with feature-missing tolerance, precise PGDN classification, and assessment of application effects in actual scenarios. Our two-level classifier, which integrates CoSENT (BERT-based), Multilayer Perceptron (MLP), and traditional classification algorithms, achieves a 97.88% precision. The research process amasses the NRD2024 dataset, which contains continuous detection information over 20 days for 1,500,000 newly registered domain names across 6 directions. Results from our case study demonstrate that this method also maintains a forecast precision of over 70% for PGDN that are delayed in usage after registration.

Paper Structure

This paper contains 22 sections, 9 equations, 16 figures, 7 tables, 1 algorithm.

Figures (16)

  • Figure 1: Flowchart of the Real-PGDN method.
  • Figure 2: Architecture of the distributed detection system.
  • Figure 3: Detection timeline.
  • Figure 4: Attempts to extract web page content features.
  • Figure 5: An example of the IP URL redirection pages.
  • ...and 11 more figures