Primus: A Pioneering Collection of Open-Source Datasets for Cybersecurity LLM Training

Yao-Ching Yu; Tsun-Han Chiang; Cheng-Wei Tsai; Chien-Ming Huang; Wen-Kwang Tsao

Primus: A Pioneering Collection of Open-Source Datasets for Cybersecurity LLM Training

Yao-Ching Yu, Tsun-Han Chiang, Cheng-Wei Tsai, Chien-Ming Huang, Wen-Kwang Tsao

TL;DR

This work introduces Primus, a comprehensive, open-source collection of cybersecurity-focused datasets spanning pretraining, instruction fine-tuning, and reasoning distillation to enhance LLMs. By combining Primus-Seed and Primus-FineWeb for continued pretraining, followed by Primus-Instruct for instruction-following and Primus-Reasoning for reasoning distillation, the authors demonstrate notable improvements across multiple cybersecurity benchmarks, including a 15.9% aggregate gain and a 15.8% CISSP gain. They also show improved calibration (lower ECE) and provide detailed data collection, preprocessing, augmentation, and evaluation methodologies, with model- and dataset-release under permissive licenses to spur further research. If scaled to larger models and extended RL approaches, this pipeline could significantly advance domain-specific LLM capabilities in cybersecurity and related critical infrastructure applications.

Abstract

Large Language Models (LLMs) have shown remarkable advancements in specialized fields such as finance, law, and medicine. However, in cybersecurity, we have noticed a lack of open-source datasets, with a particular lack of high-quality cybersecurity pretraining corpora, even though much research indicates that LLMs acquire their knowledge during pretraining. To address this, we present a comprehensive suite of datasets covering all major training stages, including pretraining, instruction fine-tuning, and reasoning distillation with cybersecurity-specific self-reflection data. Extensive ablation studies demonstrate their effectiveness on public cybersecurity benchmarks. In particular, continual pre-training on our dataset yields a 15.9% improvement in the aggregate score, while reasoning distillation leads to a 15.8% gain in security certification (CISSP). We will release all datasets and trained cybersecurity LLMs under the ODC-BY and MIT licenses to encourage further research in the community. For access to all datasets and model weights, please refer to https://huggingface.co/collections/trendmicro-ailab/primus-67b1fd27052b802b4af9d243.

Primus: A Pioneering Collection of Open-Source Datasets for Cybersecurity LLM Training

TL;DR

Abstract

Primus: A Pioneering Collection of Open-Source Datasets for Cybersecurity LLM Training

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (13)