Table of Contents
Fetching ...

Mining the YARA Ecosystem: From Ad-Hoc Sharing to Data-Driven Threat Intelligence

Dectot--Le Monnier de Gouville Esteban, Mohammad Hamdaqa, Moataz Chouchen

Abstract

YARA has established itself as the de facto standard for "Detection as Code," enabling analysts and DevSecOps practitioners to define signatures for malware identification across the software supply chain. Despite its pervasive use, the open-source YARA ecosystem remains characterized by ad-hoc sharing and opaque quality. Practitioners currently rely on public repositories without empirical evidence regarding the ecosystem's structural characteristics, maintenance and diffusion dynamics, or operational reliability. We conducted a large-scale mixed-method study of 8.4 million rules mined from 1,853 GitHub repositories. Our pipeline integrates repository mining to map supply chain dynamics, static analysis to assess syntactic quality, and dynamic benchmarking against 4,026 malware and 2,000 goodware samples to measure operational effectiveness. We reveal a highly centralized structure where 10 authors drive 80% of rule adoption. The ecosystem functions as a "static supply chain": repositories show a median inactivity of 782 days and a median technical lag of 4.2 years. While static quality scores appear high (mean = 99.4/100), operational benchmarking uncovers significant noise (false positives) and low recall. Furthermore, coverage is heavily biased toward legacy threats (Ransomware), leaving modern initial access vectors (Loaders, Stealers) severely underrepresented. These findings expose a systemic "double penalty": defenders incur high performance overhead for decayed intelligence. We argue that public repositories function as raw data dumps rather than curated feeds, necessitating a paradigm shift from ad-hoc collection to rigorous rule engineering. We release our dataset and pipeline to support future data-driven curation tools.

Mining the YARA Ecosystem: From Ad-Hoc Sharing to Data-Driven Threat Intelligence

Abstract

YARA has established itself as the de facto standard for "Detection as Code," enabling analysts and DevSecOps practitioners to define signatures for malware identification across the software supply chain. Despite its pervasive use, the open-source YARA ecosystem remains characterized by ad-hoc sharing and opaque quality. Practitioners currently rely on public repositories without empirical evidence regarding the ecosystem's structural characteristics, maintenance and diffusion dynamics, or operational reliability. We conducted a large-scale mixed-method study of 8.4 million rules mined from 1,853 GitHub repositories. Our pipeline integrates repository mining to map supply chain dynamics, static analysis to assess syntactic quality, and dynamic benchmarking against 4,026 malware and 2,000 goodware samples to measure operational effectiveness. We reveal a highly centralized structure where 10 authors drive 80% of rule adoption. The ecosystem functions as a "static supply chain": repositories show a median inactivity of 782 days and a median technical lag of 4.2 years. While static quality scores appear high (mean = 99.4/100), operational benchmarking uncovers significant noise (false positives) and low recall. Furthermore, coverage is heavily biased toward legacy threats (Ransomware), leaving modern initial access vectors (Loaders, Stealers) severely underrepresented. These findings expose a systemic "double penalty": defenders incur high performance overhead for decayed intelligence. We argue that public repositories function as raw data dumps rather than curated feeds, necessitating a paradigm shift from ad-hoc collection to rigorous rule engineering. We release our dataset and pipeline to support future data-driven curation tools.
Paper Structure (22 sections, 4 equations, 8 figures)

This paper contains 22 sections, 4 equations, 8 figures.

Figures (8)

  • Figure 1: Quote from Florian Roth introducing YARA Forge on Medium roth2023yaraforge
  • Figure 2: Fictive example of a YARA rule
  • Figure 3: Overview of our empirical study
  • Figure 4: Pareto curve of author influence: A core group of 10 authors drives 80% of all rule adoption.
  • Figure 5: Influence Map: Comparing author production volume vs. peak impact reveals distinct archetypes (Specialists vs. Mass Producers).
  • ...and 3 more figures