Table of Contents
Fetching ...

CTINexus: Automatic Cyber Threat Intelligence Knowledge Graph Construction Using Large Language Models

Yutong Cheng, Osama Bajaber, Saimon Amanuel Tsegai, Dawn Song, Peng Gao

TL;DR

CTINexus addresses the challenge of extracting rich, ontology-driven cyber threat knowledge from unstructured CTI text without large labeled datasets or extensive model tuning. It uses optimized in-context learning with a kNN-based demonstration retriever, a hierarchical entity alignment pipeline, and long-distance relation prediction to build coherent CSKGs from CTI reports. Across 150 real-world CTI reports, CTINexus achieves high triplet extraction, entity grouping/merging, and relation-prediction performance, and demonstrates strong adaptability to different ontologies (e.g., MALOnt and STIX) with efficient inference. The framework promises practical impact for CTI analysis and downstream defenses by providing a scalable, data-efficient means to maintain up-to-date, interconnected threat graphs.

Abstract

Textual descriptions in cyber threat intelligence (CTI) reports, such as security articles and news, are rich sources of knowledge about cyber threats, crucial for organizations to stay informed about the rapidly evolving threat landscape. However, current CTI knowledge extraction methods lack flexibility and generalizability, often resulting in inaccurate and incomplete knowledge extraction. Syntax parsing relies on fixed rules and dictionaries, while model fine-tuning requires large annotated datasets, making both paradigms challenging to adapt to new threats and ontologies. To bridge the gap, we propose CTINexus, a novel framework leveraging optimized in-context learning (ICL) of large language models (LLMs) for data-efficient CTI knowledge extraction and high-quality cybersecurity knowledge graph (CSKG) construction. Unlike existing methods, CTINexus requires neither extensive data nor parameter tuning and can adapt to various ontologies with minimal annotated examples. This is achieved through: (1) a carefully designed automatic prompt construction strategy with optimal demonstration retrieval for extracting a wide range of cybersecurity entities and relations; (2) a hierarchical entity alignment technique that canonicalizes the extracted knowledge and removes redundancy; (3) an long-distance relation prediction technique to further complete the CSKG with missing links. Our extensive evaluations using 150 real-world CTI reports collected from 10 platforms demonstrate that CTINexus significantly outperforms existing methods in constructing accurate and complete CSKG, highlighting its potential to transform CTI analysis with an efficient and adaptable solution for the dynamic threat landscape.

CTINexus: Automatic Cyber Threat Intelligence Knowledge Graph Construction Using Large Language Models

TL;DR

CTINexus addresses the challenge of extracting rich, ontology-driven cyber threat knowledge from unstructured CTI text without large labeled datasets or extensive model tuning. It uses optimized in-context learning with a kNN-based demonstration retriever, a hierarchical entity alignment pipeline, and long-distance relation prediction to build coherent CSKGs from CTI reports. Across 150 real-world CTI reports, CTINexus achieves high triplet extraction, entity grouping/merging, and relation-prediction performance, and demonstrates strong adaptability to different ontologies (e.g., MALOnt and STIX) with efficient inference. The framework promises practical impact for CTI analysis and downstream defenses by providing a scalable, data-efficient means to maintain up-to-date, interconnected threat graphs.

Abstract

Textual descriptions in cyber threat intelligence (CTI) reports, such as security articles and news, are rich sources of knowledge about cyber threats, crucial for organizations to stay informed about the rapidly evolving threat landscape. However, current CTI knowledge extraction methods lack flexibility and generalizability, often resulting in inaccurate and incomplete knowledge extraction. Syntax parsing relies on fixed rules and dictionaries, while model fine-tuning requires large annotated datasets, making both paradigms challenging to adapt to new threats and ontologies. To bridge the gap, we propose CTINexus, a novel framework leveraging optimized in-context learning (ICL) of large language models (LLMs) for data-efficient CTI knowledge extraction and high-quality cybersecurity knowledge graph (CSKG) construction. Unlike existing methods, CTINexus requires neither extensive data nor parameter tuning and can adapt to various ontologies with minimal annotated examples. This is achieved through: (1) a carefully designed automatic prompt construction strategy with optimal demonstration retrieval for extracting a wide range of cybersecurity entities and relations; (2) a hierarchical entity alignment technique that canonicalizes the extracted knowledge and removes redundancy; (3) an long-distance relation prediction technique to further complete the CSKG with missing links. Our extensive evaluations using 150 real-world CTI reports collected from 10 platforms demonstrate that CTINexus significantly outperforms existing methods in constructing accurate and complete CSKG, highlighting its potential to transform CTI analysis with an efficient and adaptable solution for the dynamic threat landscape.

Paper Structure

This paper contains 44 sections, 5 figures, 10 tables.

Figures (5)

  • Figure 1: CSKGs extracted by EXTRACTOR, TTPDrill, LADDER, and CTINexus for a real-world CTI report. EXTRACTOR,TTPDrill, and LADDER tend to produce incomplete and fragmented subgraphs, lacking comprehensive contextual connections. In contrast, CTINexus constructs a more integrated and comprehensive CSKG, with key information extracted and entities linked, providing a clearer and more complete representation of the threat profile.
  • Figure 2: Overview of CTINexus. CTINexus comprises three phases. Phase 1, Cybersecurity Triplet Extraction, enables end-to-end extraction of cybersecurity triplets using in-context learning of LLM. Phase 2, Hierarchical Entity Alignment, reduces the redundancy of CSKG through coarse-grained grouping and fine-grained clustering. Phase 3, Long-Distance Relation Prediction, connects disjoint subgraphs by identifying central nodes and performing relation inference.
  • Figure 3: Comparison of CTINexus's ICL-based CTI knowledge extraction (left) and a multi-turn QA-based extraction (right). CTINexus consolidates task descriptions (including applied ontology), $k$ selected demonstrations, and query into a single instruction for efficient cybersecurity triplet extraction. In contrast, the multi-turn QA paradigm requires multiple rounds of conversations with multiple prompts to extract different entities and relations, which is inefficient.
  • Figure 4: The design of CTINexus's hierarchical entity alignment. The coarse-grained entity grouping phase populates an ICL prompt to assign entity types to the extracted triplets according to the applied ontology. Entities with the same type are grouped together. The fine-grained entity merging phase then uses an embedding-based technique to merge semantically similar entities within each group based on a predefined similarity threshold. During this phase, IOC protection is enforced to prevent erroneously merging semantically similar but conceptually distinct IOC entities.
  • Figure 5: The design of CTINexus's long-distance relation prediction. Phase 1 selects central entities (blue) and the topic entity (yellow) from separate subgraphs based on their degree centrality. Phaes 2 populates an ICL prompt to infer implicit relations between each central entity and the topic entity.