Harnessing TI Feeds for Exploitation Detection
Kajal Patel, Zubair Shafiq, Mateus Nogueira, Daniel Sadoc Menasché, Enrico Lovat, Taimur Kashif, Ashton Woiwood, Matheus Martins
TL;DR
The paper tackles automated detection of vulnerability exploitation in the wild from large, heterogeneous Threat Intelligence feeds. It introduces a machine learning pipeline that leverages static and dynamic embeddings (Doc2Vec/TI2Vec; BERT/TIBERT) to convert loosely structured TI text into predictive features, trained against expert-labeled ground truth. Through longitudinal and temporal+spatial evaluations across 191 TI feeds, TI-based embeddings (especially TIBERT) achieve up to about 0.78 F1, demonstrating strong potential for data-driven vulnerability risk assessment and updating risk catalogs like CVSS/EPSS. The work provides a scalable framework for translating TI feed data into actionable exploitation signals to support incident response and proactive patch management.
Abstract
Many organizations rely on Threat Intelligence (TI) feeds to assess the risk associated with security threats. Due to the volume and heterogeneity of data, it is prohibitive to manually analyze the threat information available in different loosely structured TI feeds. Thus, there is a need to develop automated methods to vet and extract actionable information from TI feeds. To this end, we present a machine learning pipeline to automatically detect vulnerability exploitation from TI feeds. We first model threat vocabulary in loosely structured TI feeds using state-of-the-art embedding techniques (Doc2Vec and BERT) and then use it to train a supervised machine learning classifier to detect exploitation of security vulnerabilities. We use our approach to identify exploitation events in 191 different TI feeds. Our longitudinal evaluation shows that it is able to accurately identify exploitation events from TI feeds only using past data for training and even on TI feeds withheld from training. Our proposed approach is useful for a variety of downstream tasks such as data-driven vulnerability risk assessment.
