Table of Contents
Fetching ...

APT-CGLP: Advanced Persistent Threat Hunting via Contrastive Graph-Language Pre-Training

Xuebo Qiu, Mingqi Lv, Yimei Zhang, Tieming Chen, Tiantian Zhu, Qijie Song, Shouling Ji

TL;DR

APT-CGLP tackles the modality gap between provenance graphs and CTI reports by learning end-to-end cross-modal representations through multi-objective pre-training that combines graph-text contrastive learning with inter-modal masked modeling. A Graph2CTI data synthesis module uses benign provenance graphs and LLM-driven generation to produce training pairs, while a CTI denoising module distills actionable insights from noisy CTIs. The framework employs a two-stage retrieval pipeline to balance scalability and precision, and a CTI denoising process to improve real-world applicability. Across four real-world APT datasets, APT-CGLP achieves state-of-the-art threat hunting performance with high recall, low false positives, and practical retrieval efficiency, demonstrating strong potential for automatic APT hunting and alert validation without manual query engineering.

Abstract

Provenance-based threat hunting identifies Advanced Persistent Threats (APTs) on endpoints by correlating attack patterns described in Cyber Threat Intelligence (CTI) with provenance graphs derived from system audit logs. A fundamental challenge in this paradigm lies in the modality gap -- the structural and semantic disconnect between provenance graphs and CTI reports. Prior work addresses this by framing threat hunting as a graph matching task: 1) extracting attack graphs from CTI reports, and 2) aligning them with provenance graphs. However, this pipeline incurs severe \textit{information loss} during graph extraction and demands intensive manual curation, undermining scalability and effectiveness. In this paper, we present APT-CGLP, a novel cross-modal APT hunting system via Contrastive Graph-Language Pre-training, facilitating end-to-end semantic matching between provenance graphs and CTI reports without human intervention. First, empowered by the Large Language Model (LLM), APT-CGLP mitigates data scarcity by synthesizing high-fidelity provenance graph-CTI report pairs, while simultaneously distilling actionable insights from noisy web-sourced CTIs to improve their operational utility. Second, APT-CGLP incorporates a tailored multi-objective training algorithm that synergizes contrastive learning with inter-modal masked modeling, promoting cross-modal attack semantic alignment at both coarse- and fine-grained levels. Extensive experiments on four real-world APT datasets demonstrate that APT-CGLP consistently outperforms state-of-the-art threat hunting baselines in terms of accuracy and efficiency.

APT-CGLP: Advanced Persistent Threat Hunting via Contrastive Graph-Language Pre-Training

TL;DR

APT-CGLP tackles the modality gap between provenance graphs and CTI reports by learning end-to-end cross-modal representations through multi-objective pre-training that combines graph-text contrastive learning with inter-modal masked modeling. A Graph2CTI data synthesis module uses benign provenance graphs and LLM-driven generation to produce training pairs, while a CTI denoising module distills actionable insights from noisy CTIs. The framework employs a two-stage retrieval pipeline to balance scalability and precision, and a CTI denoising process to improve real-world applicability. Across four real-world APT datasets, APT-CGLP achieves state-of-the-art threat hunting performance with high recall, low false positives, and practical retrieval efficiency, demonstrating strong potential for automatic APT hunting and alert validation without manual query engineering.

Abstract

Provenance-based threat hunting identifies Advanced Persistent Threats (APTs) on endpoints by correlating attack patterns described in Cyber Threat Intelligence (CTI) with provenance graphs derived from system audit logs. A fundamental challenge in this paradigm lies in the modality gap -- the structural and semantic disconnect between provenance graphs and CTI reports. Prior work addresses this by framing threat hunting as a graph matching task: 1) extracting attack graphs from CTI reports, and 2) aligning them with provenance graphs. However, this pipeline incurs severe \textit{information loss} during graph extraction and demands intensive manual curation, undermining scalability and effectiveness. In this paper, we present APT-CGLP, a novel cross-modal APT hunting system via Contrastive Graph-Language Pre-training, facilitating end-to-end semantic matching between provenance graphs and CTI reports without human intervention. First, empowered by the Large Language Model (LLM), APT-CGLP mitigates data scarcity by synthesizing high-fidelity provenance graph-CTI report pairs, while simultaneously distilling actionable insights from noisy web-sourced CTIs to improve their operational utility. Second, APT-CGLP incorporates a tailored multi-objective training algorithm that synergizes contrastive learning with inter-modal masked modeling, promoting cross-modal attack semantic alignment at both coarse- and fine-grained levels. Extensive experiments on four real-world APT datasets demonstrate that APT-CGLP consistently outperforms state-of-the-art threat hunting baselines in terms of accuracy and efficiency.

Paper Structure

This paper contains 28 sections, 13 equations, 7 figures, 5 tables, 1 algorithm.

Figures (7)

  • Figure 1: A motivating example, where subfigure (a) denotes the denoised CTI report from the DARPA OpTC engagement, and subfigures (b) and (c) represent the query graphs generated by AttacKG (## denotes unknown entities) and the ground truth, respectively.
  • Figure 2: APT-CGLP architecture. (a) Synthetic provenance graph-CTI report pair generation for training; (b) Cross-modal semantic alignment via multi-objective pre-training; (c) CTI report denoising to enhance the usability; and (d) Two-stage retrieval using trained encoders in (b) to balance efficiency and precision in threat hunting.
  • Figure 3: The training framework of APT-CGLP. GTC: graph-text contrastive, GTM: graph-text matching, MGM: masked graph modeling, MLM: masked language modeling, FFN: feed forward network.
  • Figure 4: Results of ablation studies across different datasets.
  • Figure 5: Overhead of APT-CGLP for different $k$.
  • ...and 2 more figures