Table of Contents
Fetching ...

Anomaly Detection in OKTA Logs using Autoencoders

Jericho Cain, Hayden Beadles, Karthik Venkatesan

TL;DR

This work adopts unsupervised techniques, specifically employing autoencoders to properly use an autoencoder to transform and simplify the complexity of the log data the authors receive from their users.

Abstract

Okta logs are used today to detect cybersecurity events using various rule-based models with restricted look back periods. These functions have limitations, such as a limited retrospective analysis, a predefined rule set, and susceptibility to generating false positives. To address this, we adopt unsupervised techniques, specifically employing autoencoders. To properly use an autoencoder, we need to transform and simplify the complexity of the log data we receive from our users. This transformed and filtered data is then fed into the autoencoder, and the output is evaluated.

Anomaly Detection in OKTA Logs using Autoencoders

TL;DR

This work adopts unsupervised techniques, specifically employing autoencoders to properly use an autoencoder to transform and simplify the complexity of the log data the authors receive from their users.

Abstract

Okta logs are used today to detect cybersecurity events using various rule-based models with restricted look back periods. These functions have limitations, such as a limited retrospective analysis, a predefined rule set, and susceptibility to generating false positives. To address this, we adopt unsupervised techniques, specifically employing autoencoders. To properly use an autoencoder, we need to transform and simplify the complexity of the log data we receive from our users. This transformed and filtered data is then fed into the autoencoder, and the output is evaluated.

Paper Structure

This paper contains 21 sections, 14 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: This plot shows the median login counts for actors by day across a sample of logins in Jan 2023. Notice the dips that occur on the weekends and the general, predictable ebb and flow of the counts.
  • Figure 2: This references a website: https://www.movable-type.co.uk/scripts/geohash.html. This site gives a good introduction to geohashing and a good visual introduction to precision and formulas, background on hashing algorithms, etc.
  • Figure 3: This map shows a set of logins, referenced as blue points, for a single actor. This actor's location is known as Oregon, but it can't be easily seen, due to the noise of the login data.
  • Figure 4: After changing lat / lon locations to geohashes and performing some simply frequency analysis on logins per actor, we choose the login frequencies that are most relevant for an actor and adjust for the population proportion using the Wilson Score Confidence Interval.
  • Figure 5: This plot shows the application login density for a single actor. The x-axis reflects the encoded indices of the applications accessed, and the y-axis reflects the density, or distribution of those logins. Before estimating the effects of the population, we notice noise, especially on the right tail of the above distribution. Our goal is to get a better view of the actual application login distribution per actor, to catch anomalies. Removing the noise is therefore, our goal.
  • ...and 4 more figures