Unveiling A Hidden Risk: Exposing Educational but Malicious Repositories in GitHub

Md Rayhanul Masud; Michalis Faloutsos

Unveiling A Hidden Risk: Exposing Educational but Malicious Repositories in GitHub

Md Rayhanul Masud, Michalis Faloutsos

TL;DR

The paper addresses the risk that repositories labeled as educational on GitHub may conceal malicious code (MalEdu). It constructs a large-scale dataset by querying GitHub for educational repos and filtering to $22.2\text{K}$ with both description and readme content, then applies a two-query ChatGPT annotation workflow to label MalEdu. It finds $9{,}294$ MalEdu repos ($26\%$) across $14$ malware families, with keylogger the most frequent ($1{,}071$ repos), and reports $85\%$ precision in a manual validation. The results highlight a wake-up call for platform governance and the need for deeper analysis of software platforms to mitigate hidden malware in educational content.

Abstract

Are malicious repositories hiding under the educational label in GitHub? Recent studies have identified collections of GitHub repositories hosting malware source code with notable collaboration among the developers. Thus, analyzing GitHub repositories deserves inevitable attention due to its open-source nature providing easy access to malicious software code and artifacts. Here we leverage the capabilities of ChatGPT in a qualitative study to annotate an educational GitHub repository based on maliciousness of its metadata contents. Our contribution is twofold. First, we demonstrate the employment of ChatGPT to understand and annotate the content published in software repositories. Second, we provide evidence of hidden risk in educational repositories contributing to the opportunities of potential threats and malicious intents. We carry out a systematic study on a collection of 35.2K GitHub repositories claimed to be created for educational purposes only. First, our study finds an increasing trend in the number of such repositories published every year. Second, 9294 of them are labeled by ChatGPT as malicious, and further categorization of the malicious ones detects 14 different malware families including DDoS, keylogger, ransomware and so on. Overall, this exploratory study flags a wake-up call for the community for better understanding and analysis of software platforms.

Unveiling A Hidden Risk: Exposing Educational but Malicious Repositories in GitHub

TL;DR

with both description and readme content, then applies a two-query ChatGPT annotation workflow to label MalEdu. It finds

MalEdu repos (

) across

malware families, with keylogger the most frequent (

repos), and reports

precision in a manual validation. The results highlight a wake-up call for platform governance and the need for deeper analysis of software platforms to mitigate hidden malware in educational content.

Abstract

Paper Structure (7 sections, 4 figures)

This paper contains 7 sections, 4 figures.

Problem Definition
Contribution
Methodology
Results and Evaluation
Future Work
Related Work
Acknowledgment

Figures (4)

Figure 1: The number of educational GitHub repositories is increasing every year. The trend is similar for MalEdu (educational, but malicious) repositories.
Figure 2: (a) Example metadata of GitHub repository that hosts ransomware source code, while created for educational purpose only. (b) First, a collection of educational repos is classified by ChatGPT. Then, identified MalEdu repos are classified into malware families.
Figure 3: Top 10 malware families detected among MalEdu repos.
Figure 4: Normalized Confusion Matrix for ChatGPT annotations.

Unveiling A Hidden Risk: Exposing Educational but Malicious Repositories in GitHub

TL;DR

Abstract

Unveiling A Hidden Risk: Exposing Educational but Malicious Repositories in GitHub

Authors

TL;DR

Abstract

Table of Contents

Figures (4)