Table of Contents
Fetching ...

STRisk: A Socio-Technical Approach to Assess Hacking Breaches Risk

Hicham Hammouchi, Narjisse Nejjari, Ghita Mezzour, Mounir Ghogho, Houda Benbrahim

TL;DR

STRisk is a predictive system where the scope of the prediction task is expanded by bringing into play the social media dimension, and it is revealed that open ports and expired certificates are the best technical predictors, while spreadability and agreeability are thebest social predictors.

Abstract

Data breaches have begun to take on new dimensions and their prediction is becoming of great importance to organizations. Prior work has addressed this issue mainly from a technical perspective and neglected other interfering aspects such as the social media dimension. To fill this gap, we propose STRisk which is a predictive system where we expand the scope of the prediction task by bringing into play the social media dimension. We study over 3800 US organizations including both victim and non-victim organizations. For each organization, we design a profile composed of a variety of externally measured technical indicators and social factors. In addition, to account for unreported incidents, we consider the non-victim sample to be noisy and propose a noise correction approach to correct mislabeled organizations. We then build several machine learning models to predict whether an organization is exposed to experience a hacking breach. By exploiting both technical and social features, we achieve a Area Under Curve (AUC) score exceeding 98%, which is 12% higher than the AUC achieved using only technical features. Furthermore, our feature importance analysis reveals that open ports and expired certificates are the best technical predictors, while spreadability and agreeability are the best social predictors.

STRisk: A Socio-Technical Approach to Assess Hacking Breaches Risk

TL;DR

STRisk is a predictive system where the scope of the prediction task is expanded by bringing into play the social media dimension, and it is revealed that open ports and expired certificates are the best technical predictors, while spreadability and agreeability are thebest social predictors.

Abstract

Data breaches have begun to take on new dimensions and their prediction is becoming of great importance to organizations. Prior work has addressed this issue mainly from a technical perspective and neglected other interfering aspects such as the social media dimension. To fill this gap, we propose STRisk which is a predictive system where we expand the scope of the prediction task by bringing into play the social media dimension. We study over 3800 US organizations including both victim and non-victim organizations. For each organization, we design a profile composed of a variety of externally measured technical indicators and social factors. In addition, to account for unreported incidents, we consider the non-victim sample to be noisy and propose a noise correction approach to correct mislabeled organizations. We then build several machine learning models to predict whether an organization is exposed to experience a hacking breach. By exploiting both technical and social features, we achieve a Area Under Curve (AUC) score exceeding 98%, which is 12% higher than the AUC achieved using only technical features. Furthermore, our feature importance analysis reveals that open ports and expired certificates are the best technical predictors, while spreadability and agreeability are the best social predictors.

Paper Structure

This paper contains 38 sections, 8 equations, 5 figures, 17 tables.

Figures (5)

  • Figure 1: STRisk pipeline to combine technical misconfigurations and Twitter social signals for both victim and non-victim organizations, correct noisy labels and build the predictive models to discriminate risky organizations from non-risky ones
  • Figure 2: Boxplot of predicted probabilities for the examples that were selected to be flipped
  • Figure 3: Example of stacking model using Catboost and Bagging classifiers and Logistic Regression as meta model
  • Figure 4: Per-class features interpretability on test set using SHAP values and CatBoost model
  • Figure 5: Separate predictions using each category of features alone Vs. overall predictions using all feature set that cover technical, social (Twitter), sector, and organization size