Table of Contents
Fetching ...

Maximizing Information Gain in Privacy-Aware Active Learning of Email Anomalies

Mu-Huan Miles Chung, Sharon Li, Jaturong Kongmanee, Lu Wang, Yuhong Yang, Calvin Giang, Khilan Jerath, Abhay Raman, David Lie, Mark Chignell

TL;DR

The study tackles privacy-preserving email anomaly detection with Active Learning, where labels come from analysts working on redacted data. It proposes Expert-Derived Information Gain (EDIG), a sampling strategy that blends model uncertainty with expert confidence to maximize information gain in the labeling loop. Through two case studies in an enterprise setting, EDIG improves learning efficiency and early performance, while highlighting the importance of expert screening and confidence calibration; results show benefits are strongest in early AL stages and when labelers are well-calibrated. The work provides actionable guidance for deploying privacy-aware AL in cybersecurity, including when to use EDIG, how to select and train analysts, and how to measure both model and human uncertainty. Overall, confidence-informed EDIG offers a practical pathway to more effective AL under privacy constraints, with implications for faster incident detection and reduced labeling burden.

Abstract

Redacted emails satisfy most privacy requirements but they make it more difficult to detect anomalous emails that may be indicative of data exfiltration. In this paper we develop an enhanced method of Active Learning using an information gain maximizing heuristic, and we evaluate its effectiveness in a real world setting where only redacted versions of email could be labeled by human analysts due to privacy concerns. In the first case study we examined how Active Learning should be carried out. We found that model performance was best when a single highly skilled (in terms of the labelling task) analyst provided the labels. In the second case study we used confidence ratings to estimate the labeling uncertainty of analysts and then prioritized instances for labeling based on the expected information gain (the difference between model uncertainty and analyst uncertainty) that would be provided by labelling each instance. We found that the information maximization gain heuristic improved model performance over existing sampling methods for Active Learning. Based on the results obtained, we recommend that analysts should be screened, and possibly trained, prior to implementation of Active Learning in cybersecurity applications. We also recommend that the information gain maximizing sample method (based on expert confidence) should be used in early stages of Active Learning, providing that well-calibrated confidence can be obtained. We also note that the expertise of analysts should be assessed prior to Active Learning, as we found that analysts with lower labelling skill had poorly calibrated (over-) confidence in their labels.

Maximizing Information Gain in Privacy-Aware Active Learning of Email Anomalies

TL;DR

The study tackles privacy-preserving email anomaly detection with Active Learning, where labels come from analysts working on redacted data. It proposes Expert-Derived Information Gain (EDIG), a sampling strategy that blends model uncertainty with expert confidence to maximize information gain in the labeling loop. Through two case studies in an enterprise setting, EDIG improves learning efficiency and early performance, while highlighting the importance of expert screening and confidence calibration; results show benefits are strongest in early AL stages and when labelers are well-calibrated. The work provides actionable guidance for deploying privacy-aware AL in cybersecurity, including when to use EDIG, how to select and train analysts, and how to measure both model and human uncertainty. Overall, confidence-informed EDIG offers a practical pathway to more effective AL under privacy constraints, with implications for faster incident detection and reduced labeling burden.

Abstract

Redacted emails satisfy most privacy requirements but they make it more difficult to detect anomalous emails that may be indicative of data exfiltration. In this paper we develop an enhanced method of Active Learning using an information gain maximizing heuristic, and we evaluate its effectiveness in a real world setting where only redacted versions of email could be labeled by human analysts due to privacy concerns. In the first case study we examined how Active Learning should be carried out. We found that model performance was best when a single highly skilled (in terms of the labelling task) analyst provided the labels. In the second case study we used confidence ratings to estimate the labeling uncertainty of analysts and then prioritized instances for labeling based on the expected information gain (the difference between model uncertainty and analyst uncertainty) that would be provided by labelling each instance. We found that the information maximization gain heuristic improved model performance over existing sampling methods for Active Learning. Based on the results obtained, we recommend that analysts should be screened, and possibly trained, prior to implementation of Active Learning in cybersecurity applications. We also recommend that the information gain maximizing sample method (based on expert confidence) should be used in early stages of Active Learning, providing that well-calibrated confidence can be obtained. We also note that the expertise of analysts should be assessed prior to Active Learning, as we found that analysts with lower labelling skill had poorly calibrated (over-) confidence in their labels.
Paper Structure (32 sections, 11 equations, 14 figures, 2 tables)

This paper contains 32 sections, 11 equations, 14 figures, 2 tables.

Figures (14)

  • Figure 1: Possible combinations of human and model certainty and uncertainty (highlighting the “sweet” spot where humans are relatively certain and the model is relatively uncertain)
  • Figure 2: AL implementation in a binary classification task
  • Figure 3: Pre-labeled dataset participant confidence values’ means and error bars for each label class
  • Figure 4: Trend of average 1 (True) label percentages of the first 14 instances in each round
  • Figure 5: Individual differences in terms of average confidence level values and their variations in round 1 (left panel) and in all rounds (right panel)
  • ...and 9 more figures