Table of Contents
Fetching ...

Primus: A Pioneering Collection of Open-Source Datasets for Cybersecurity LLM Training

Yao-Ching Yu, Tsun-Han Chiang, Cheng-Wei Tsai, Chien-Ming Huang, Wen-Kwang Tsao

TL;DR

This work introduces Primus, a comprehensive, open-source collection of cybersecurity-focused datasets spanning pretraining, instruction fine-tuning, and reasoning distillation to enhance LLMs. By combining Primus-Seed and Primus-FineWeb for continued pretraining, followed by Primus-Instruct for instruction-following and Primus-Reasoning for reasoning distillation, the authors demonstrate notable improvements across multiple cybersecurity benchmarks, including a 15.9% aggregate gain and a 15.8% CISSP gain. They also show improved calibration (lower ECE) and provide detailed data collection, preprocessing, augmentation, and evaluation methodologies, with model- and dataset-release under permissive licenses to spur further research. If scaled to larger models and extended RL approaches, this pipeline could significantly advance domain-specific LLM capabilities in cybersecurity and related critical infrastructure applications.

Abstract

Large Language Models (LLMs) have shown remarkable advancements in specialized fields such as finance, law, and medicine. However, in cybersecurity, we have noticed a lack of open-source datasets, with a particular lack of high-quality cybersecurity pretraining corpora, even though much research indicates that LLMs acquire their knowledge during pretraining. To address this, we present a comprehensive suite of datasets covering all major training stages, including pretraining, instruction fine-tuning, and reasoning distillation with cybersecurity-specific self-reflection data. Extensive ablation studies demonstrate their effectiveness on public cybersecurity benchmarks. In particular, continual pre-training on our dataset yields a 15.9% improvement in the aggregate score, while reasoning distillation leads to a 15.8% gain in security certification (CISSP). We will release all datasets and trained cybersecurity LLMs under the ODC-BY and MIT licenses to encourage further research in the community. For access to all datasets and model weights, please refer to https://huggingface.co/collections/trendmicro-ailab/primus-67b1fd27052b802b4af9d243.

Primus: A Pioneering Collection of Open-Source Datasets for Cybersecurity LLM Training

TL;DR

This work introduces Primus, a comprehensive, open-source collection of cybersecurity-focused datasets spanning pretraining, instruction fine-tuning, and reasoning distillation to enhance LLMs. By combining Primus-Seed and Primus-FineWeb for continued pretraining, followed by Primus-Instruct for instruction-following and Primus-Reasoning for reasoning distillation, the authors demonstrate notable improvements across multiple cybersecurity benchmarks, including a 15.9% aggregate gain and a 15.8% CISSP gain. They also show improved calibration (lower ECE) and provide detailed data collection, preprocessing, augmentation, and evaluation methodologies, with model- and dataset-release under permissive licenses to spur further research. If scaled to larger models and extended RL approaches, this pipeline could significantly advance domain-specific LLM capabilities in cybersecurity and related critical infrastructure applications.

Abstract

Large Language Models (LLMs) have shown remarkable advancements in specialized fields such as finance, law, and medicine. However, in cybersecurity, we have noticed a lack of open-source datasets, with a particular lack of high-quality cybersecurity pretraining corpora, even though much research indicates that LLMs acquire their knowledge during pretraining. To address this, we present a comprehensive suite of datasets covering all major training stages, including pretraining, instruction fine-tuning, and reasoning distillation with cybersecurity-specific self-reflection data. Extensive ablation studies demonstrate their effectiveness on public cybersecurity benchmarks. In particular, continual pre-training on our dataset yields a 15.9% improvement in the aggregate score, while reasoning distillation leads to a 15.8% gain in security certification (CISSP). We will release all datasets and trained cybersecurity LLMs under the ODC-BY and MIT licenses to encourage further research in the community. For access to all datasets and model weights, please refer to https://huggingface.co/collections/trendmicro-ailab/primus-67b1fd27052b802b4af9d243.

Paper Structure

This paper contains 54 sections, 13 figures, 20 tables.

Figures (13)

  • Figure 1: Overview of our training pipeline. Primus-Pretraining, Primus-Instruct, and Primus-Reasoning are the datasets of different training stages.
  • Figure 2: Motivation behind Primus. Statistics of existing cybersecurity language models, where reasoning means training models to reason via distillation or RL.
  • Figure 3: Cumulative token count in FineWeb for texts with a cybersecurity score exceeding various thresholds.
  • Figure 4: Ratio of cybersecurity-related text across different score bins in FineWeb.
  • Figure 5: Comparison of deduplication on FineWeb cybersecurity data filtered at a classifier threshold 0.9.
  • ...and 8 more figures