Table of Contents
Fetching ...

IoT Device Labeling Using Large Language Models

Bar Meyuhas, Anat Bremler-Barr, Tal Shapira

TL;DR

This work tackles a key challenge in IoT labeling: how can an AI solution label an IoT device that has never been seen before and whose label is unknown?

Abstract

The IoT market is diverse and characterized by a multitude of vendors that support different device functions (e.g., speaker, camera, vacuum cleaner, etc.). Within this market, IoT security and observability systems use real-time identification techniques to manage these devices effectively. Most existing IoT identification solutions employ machine learning techniques that assume the IoT device, labeled by both its vendor and function, was observed during their training phase. We tackle a key challenge in IoT labeling: how can an AI solution label an IoT device that has never been seen before and whose label is unknown? Our solution extracts textual features such as domain names and hostnames from network traffic, and then enriches these features using Google search data alongside catalog of vendors and device functions. The solution also integrates an auto-update mechanism that uses Large Language Models (LLMs) to update these catalogs with emerging device types. Based on the information gathered, the device's vendor is identified through string matching with the enriched features. The function is then deduced by LLMs and zero-shot classification from a predefined catalog of IoT functions. In an evaluation of our solution on 97 unique IoT devices, our function labeling approach achieved HIT1 and HIT2 scores of 0.7 and 0.77, respectively. As far as we know, this is the first research to tackle AI-automated IoT labeling.

IoT Device Labeling Using Large Language Models

TL;DR

This work tackles a key challenge in IoT labeling: how can an AI solution label an IoT device that has never been seen before and whose label is unknown?

Abstract

The IoT market is diverse and characterized by a multitude of vendors that support different device functions (e.g., speaker, camera, vacuum cleaner, etc.). Within this market, IoT security and observability systems use real-time identification techniques to manage these devices effectively. Most existing IoT identification solutions employ machine learning techniques that assume the IoT device, labeled by both its vendor and function, was observed during their training phase. We tackle a key challenge in IoT labeling: how can an AI solution label an IoT device that has never been seen before and whose label is unknown? Our solution extracts textual features such as domain names and hostnames from network traffic, and then enriches these features using Google search data alongside catalog of vendors and device functions. The solution also integrates an auto-update mechanism that uses Large Language Models (LLMs) to update these catalogs with emerging device types. Based on the information gathered, the device's vendor is identified through string matching with the enriched features. The function is then deduced by LLMs and zero-shot classification from a predefined catalog of IoT functions. In an evaluation of our solution on 97 unique IoT devices, our function labeling approach achieved HIT1 and HIT2 scores of 0.7 and 0.77, respectively. As far as we know, this is the first research to tackle AI-automated IoT labeling.
Paper Structure (9 sections, 6 figures, 5 tables, 2 algorithms)

This paper contains 9 sections, 6 figures, 5 tables, 2 algorithms.

Figures (6)

  • Figure 1: Example of Features for the SmartThing Hub: First, we present the features derived from the traffic, followed by a sample of the enriched features (the color correlates between the feature and the enriched feature). Words relevant to the vendor label decision are highlighted in bold, and those relevant to function decisions are underlined.
  • Figure 2: A schematic illustration of our IoT labeling solution. First, features are being extracted and then enriched. Second, we perform our vendor and function models labeling. The system's output is label, confidence and justification for each device.
  • Figure 3: CDF distribution showing the number of results returned per feature value
  • Figure 4: Comparative analyses of labeling accuracy per feature, indicated by filled bars and availability, indicated by hollow bars (represents the percentage that the feature exists across the dataset). Figure \ref{['fig:nlp_type_results']} presents the accuracy of the device function while Figure \ref{['fig:availability_and_accuacy_features']} presents the vendor.
  • Figure 5: Confidence score per device of our vendor labeling algorithm (equals to the number of label matching in the enriched data), ordered by the first score matching. The highest score is marked in a circle and the second highest is marked in a triangle.
  • ...and 1 more figures

Theorems & Definitions (2)

  • Definition 1
  • Definition 2