POLAR: Automating Cyber Threat Prioritization through LLM-Powered Assessment
Luoxi Tang, Yuqiao Meng, Ankita Patra, Weicheng Ma, Muchao Ye, Zhaohan Xi
TL;DR
Polar addresses the challenge of prioritizing a deluge of cyber threat intelligence by providing an end-to-end LLM-powered pipeline that converts unstructured CTI into structured threat instances with actionable mitigations. It introduces a four-stage workflow—CTI triage, static CVSS scoring, exploitation forecasting, and mitigation recommendation—that grounds reasoning in external knowledge bases while maintaining end-to-end automation. The authors identify intrinsic LLM vulnerabilities in CTI (spurious correlations, contradictory knowledge, constrained generalization) and present a categorization methodology with stratification, autoregressive refinement, and human-in-the-loop supervision. Experiments on real-world CTI datasets show Polar outperforms baseline LLMs and specialized agents in triage accuracy, static-analysis reliability, exploitation forecasting, and mitigation prioritization, with code released for reproducibility.
Abstract
Large Language Models (LLMs) are intensively used to assist security analysts in counteracting the rapid exploitation of cyber threats, wherein LLMs offer cyber threat intelligence (CTI) to support vulnerability assessment and incident response. While recent work has shown that LLMs can support a wide range of CTI tasks such as threat analysis, vulnerability detection, and intrusion defense, significant performance gaps persist in practical deployments. In this paper, we investigate the intrinsic vulnerabilities of LLMs in CTI, focusing on challenges that arise from the nature of the threat landscape itself rather than the model architecture. Using large-scale evaluations across multiple CTI benchmarks and real-world threat reports, we introduce a novel categorization methodology that integrates stratification, autoregressive refinement, and human-in-the-loop supervision to reliably analyze failure instances. Through extensive experiments and human inspections, we reveal three fundamental vulnerabilities: spurious correlations, contradictory knowledge, and constrained generalization, that limit LLMs in effectively supporting CTI. Subsequently, we provide actionable insights for designing more robust LLM-powered CTI systems to facilitate future research.
