TRACE: Timely Retrieval and Alignment for Cybersecurity Knowledge Graph Construction and Expansion
Zijing Xu, Ziwei Ning, Tiancheng Hu, Jianwei Zhuge, Yangyang Wang, Jiahao Cao, Mingwei Xu
TL;DR
TRACE presents a framework to address timeliness and coverage gaps in cybersecurity knowledge graphs by unifying 24 structured data sources with 3 categories of unstructured data. It defines a generalized cybersecurity ontology and uses LLMs with retrieval-augmented generation to extract and align entities, enabling continuous, near-real-time expansion of the CKG. The approach yields a large-scale graph (56 node types, 112 edge types) with substantial gains in coverage ($1.82\times$ over prior graphs) and competitive extraction accuracy ($86.08\%$ precision, $76.92\%$ recall, $81.24\%$ F1) compared to baselines, while demonstrating strong entity alignment and practical utility via case studies. The work enables threat hunters to obtain comprehensive, up-to-date insights into vulnerabilities, attack methods, and defensive technologies, supporting proactive cyber risk management, with future work focusing on reducing isolated nodes, improving prompt design, and incorporating multimodal data.
Abstract
The rapid evolution of cyber threats has highlighted significant gaps in security knowledge integration. Cybersecurity Knowledge Graphs (CKGs) relying on structured data inherently exhibit hysteresis, as the timely incorporation of rapidly evolving unstructured data remains limited, potentially leading to the omission of critical insights for risk analysis. To address these limitations, we introduce TRACE, a framework designed to integrate structured and unstructured cybersecurity data sources. TRACE integrates knowledge from 24 structured databases and 3 categories of unstructured data, including APT reports, papers, and repair notices. Leveraging Large Language Models (LLMs), TRACE facilitates efficient entity extraction and alignment, enabling continuous updates to the CKG. Evaluations demonstrate that TRACE achieves a 1.8x increase in node coverage compared to existing CKGs. TRACE attains the precision of 86.08%, the recall of 76.92%, and the F1 score of 81.24% in entity extraction, surpassing the best-known LLM-based baselines by 7.8%. Furthermore, our entity alignment methods effectively harmonize entities with existing knowledge structures, enhancing the integrity and utility of the CKG. With TRACE, threat hunters and attack analysts gain real-time, holistic insights into vulnerabilities, attack methods, and defense technologies.
