Mining the YARA Ecosystem: From Ad-Hoc Sharing to Data-Driven Threat Intelligence

Dectot--Le Monnier de Gouville Esteban; Mohammad Hamdaqa; Moataz Chouchen

Mining the YARA Ecosystem: From Ad-Hoc Sharing to Data-Driven Threat Intelligence

Dectot--Le Monnier de Gouville Esteban, Mohammad Hamdaqa, Moataz Chouchen

Abstract

YARA has established itself as the de facto standard for "Detection as Code," enabling analysts and DevSecOps practitioners to define signatures for malware identification across the software supply chain. Despite its pervasive use, the open-source YARA ecosystem remains characterized by ad-hoc sharing and opaque quality. Practitioners currently rely on public repositories without empirical evidence regarding the ecosystem's structural characteristics, maintenance and diffusion dynamics, or operational reliability. We conducted a large-scale mixed-method study of 8.4 million rules mined from 1,853 GitHub repositories. Our pipeline integrates repository mining to map supply chain dynamics, static analysis to assess syntactic quality, and dynamic benchmarking against 4,026 malware and 2,000 goodware samples to measure operational effectiveness. We reveal a highly centralized structure where 10 authors drive 80% of rule adoption. The ecosystem functions as a "static supply chain": repositories show a median inactivity of 782 days and a median technical lag of 4.2 years. While static quality scores appear high (mean = 99.4/100), operational benchmarking uncovers significant noise (false positives) and low recall. Furthermore, coverage is heavily biased toward legacy threats (Ransomware), leaving modern initial access vectors (Loaders, Stealers) severely underrepresented. These findings expose a systemic "double penalty": defenders incur high performance overhead for decayed intelligence. We argue that public repositories function as raw data dumps rather than curated feeds, necessitating a paradigm shift from ad-hoc collection to rigorous rule engineering. We release our dataset and pipeline to support future data-driven curation tools.

Mining the YARA Ecosystem: From Ad-Hoc Sharing to Data-Driven Threat Intelligence

Abstract

Paper Structure (22 sections, 4 equations, 8 figures)

This paper contains 22 sections, 4 equations, 8 figures.

Introduction
Background
Related Works
Methodology
Repository Discovery
Rule Extraction and Validation
Rule Deduplication and Clustering
Rule Propagation and Maintenance Analysis
Ecosystem Structure Analysis
Author Contribution Analysis
Rule Quality Assessment
Threat Coverage Classification and Analysis
Results
RQ1: Ecosystem Dynamics and Maintenance
RQ1.1 Structure: A Redundant and Fragmented Landscape
...and 7 more sections

Figures (8)

Figure 1: Quote from Florian Roth introducing YARA Forge on Medium roth2023yaraforge
Figure 2: Fictive example of a YARA rule
Figure 3: Overview of our empirical study
Figure 4: Pareto curve of author influence: A core group of 10 authors drives 80% of all rule adoption.
Figure 5: Influence Map: Comparing author production volume vs. peak impact reveals distinct archetypes (Specialists vs. Mass Producers).
...and 3 more figures

Mining the YARA Ecosystem: From Ad-Hoc Sharing to Data-Driven Threat Intelligence

Abstract

Mining the YARA Ecosystem: From Ad-Hoc Sharing to Data-Driven Threat Intelligence

Authors

Abstract

Table of Contents

Figures (8)