Table of Contents
Fetching ...

Understanding IoT Domain Names: Analysis and Classification Using Machine Learning

Ibrahim Ayoub, Martine S. Lenders, Benoît Ampeau, Sandoche Balakrichenan, Kinda Khawam, Thomas C. Schmidt, Matthias Wählisch

TL;DR

The study addresses how IoT M2M devices contact a distinct set of domains and whether these can be reliably distinguished from domains used by Other Devices. It constructs IoT M2M Names and Other Names from 12 public DNS-trace datasets and two top-lists (Cisco and Tranco), then converts domain-name tokens into real-valued vectors via Word2vec with CBOW, padding to length $40$ and vector dimension $32$. Six classifiers, led by Random Forest, achieve high accuracy (near $99 ext{%}$ in several setups) and robust cross-validation, while ablation reveals the second-level domain as the most informative feature. The work provides actionable insights for protocol design and network security, and highlights data-source limitations and directions for future, broader data collection including malicious and DGAs. Overall, the approach demonstrates strong discriminative power for IoT M2M domain names and informs monitoring and defense strategies in IoT networks.

Abstract

In this paper, we investigate the domain names of servers on the Internet that are accessed by IoT devices performing machine-to-machine communications. Using machine learning, we classify between them and domain names of servers contacted by other types of devices. By surveying past studies that used testbeds with real-world devices and using lists of top visited websites, we construct lists of domain names of both types of servers. We study the statistical properties of the domain name lists and train six machine learning models to perform the classification. The word embedding technique we use to get the real-value representation of the domain names is Word2vec. Among the models we train, Random Forest achieves the highest performance in classifying the domain names, yielding the highest accuracy, precision, recall, and F1 score. Our work offers novel insights to IoT, potentially informing protocol design and aiding in network security and performance monitoring.

Understanding IoT Domain Names: Analysis and Classification Using Machine Learning

TL;DR

The study addresses how IoT M2M devices contact a distinct set of domains and whether these can be reliably distinguished from domains used by Other Devices. It constructs IoT M2M Names and Other Names from 12 public DNS-trace datasets and two top-lists (Cisco and Tranco), then converts domain-name tokens into real-valued vectors via Word2vec with CBOW, padding to length and vector dimension . Six classifiers, led by Random Forest, achieve high accuracy (near in several setups) and robust cross-validation, while ablation reveals the second-level domain as the most informative feature. The work provides actionable insights for protocol design and network security, and highlights data-source limitations and directions for future, broader data collection including malicious and DGAs. Overall, the approach demonstrates strong discriminative power for IoT M2M domain names and informs monitoring and defense strategies in IoT networks.

Abstract

In this paper, we investigate the domain names of servers on the Internet that are accessed by IoT devices performing machine-to-machine communications. Using machine learning, we classify between them and domain names of servers contacted by other types of devices. By surveying past studies that used testbeds with real-world devices and using lists of top visited websites, we construct lists of domain names of both types of servers. We study the statistical properties of the domain name lists and train six machine learning models to perform the classification. The word embedding technique we use to get the real-value representation of the domain names is Word2vec. Among the models we train, Random Forest achieves the highest performance in classifying the domain names, yielding the highest accuracy, precision, recall, and F1 score. Our work offers novel insights to IoT, potentially informing protocol design and aiding in network security and performance monitoring.
Paper Structure (27 sections, 1 equation, 9 figures, 4 tables)

This paper contains 27 sections, 1 equation, 9 figures, 4 tables.

Figures (9)

  • Figure 1: Our method applied in this paper.
  • Figure 2: Violin plots for name properties found for each domain name in our datasets.
  • Figure 3: Top 50 labels in the datasets. The top 4 of those are shown separately, while the remaining 46 are summarized to "others".
  • Figure 4: Word embedding: After prepending ' *' to each domain name until it has 40 labels, Word2vec is used to generate a real-valued vector representation of $32 \times 40$ real numbers of each domain name.
  • Figure 5: Accuracy, precision, recall, and $F_{1}$ score of each ML model for the top 1415 domain names from Other Names, Cisco, and Tranco, plus a uniformly sampled Mix of 1415 domain names from the three lists, each vs. the 1415 domain names from IoT M2M Names.
  • ...and 4 more figures