Table of Contents
Fetching ...

Automating SBOM Generation with Zero-Shot Semantic Similarity

Devin Pereira, Christopher Molloy, Sudipta Acharya, Steven H. H. Ding

TL;DR

This work interprets the problem of static code analysis as a semantic similarity task wherein a transformer model can be trained to relate a product name to corresponding version strings, further demonstrating the model's strong performance in the zero-shot classification task and the potential for use in a real-world cybersecurity context.

Abstract

It is becoming increasingly important in the software industry, especially with the growing complexity of software ecosystems and the emphasis on security and compliance for manufacturers to inventory software used on their systems. A Software-Bill-of-Materials (SBOM) is a comprehensive inventory detailing a software application's components and dependencies. Current approaches rely on case-based reasoning to inconsistently identify the software components embedded in binary files. We propose a different route, an automated method for generating SBOMs to prevent disastrous supply-chain attacks. Remaining on the topic of static code analysis, we interpret this problem as a semantic similarity task wherein a transformer model can be trained to relate a product name to corresponding version strings. Our test results are compelling, demonstrating the model's strong performance in the zero-shot classification task, further demonstrating the potential for use in a real-world cybersecurity context.

Automating SBOM Generation with Zero-Shot Semantic Similarity

TL;DR

This work interprets the problem of static code analysis as a semantic similarity task wherein a transformer model can be trained to relate a product name to corresponding version strings, further demonstrating the model's strong performance in the zero-shot classification task and the potential for use in a real-world cybersecurity context.

Abstract

It is becoming increasingly important in the software industry, especially with the growing complexity of software ecosystems and the emphasis on security and compliance for manufacturers to inventory software used on their systems. A Software-Bill-of-Materials (SBOM) is a comprehensive inventory detailing a software application's components and dependencies. Current approaches rely on case-based reasoning to inconsistently identify the software components embedded in binary files. We propose a different route, an automated method for generating SBOMs to prevent disastrous supply-chain attacks. Remaining on the topic of static code analysis, we interpret this problem as a semantic similarity task wherein a transformer model can be trained to relate a product name to corresponding version strings. Our test results are compelling, demonstrating the model's strong performance in the zero-shot classification task, further demonstrating the potential for use in a real-world cybersecurity context.
Paper Structure (12 sections, 3 equations, 2 figures, 5 tables)

This paper contains 12 sections, 3 equations, 2 figures, 5 tables.

Figures (2)

  • Figure 1: Binaries obtained from different sources are passed through extraction scripts. Next, strings are generated from the resultant ELF files. Finally, these strings are funneled through a regular-expression based filtering script. The final data points can be observed as a tuple with four different fields.
  • Figure 2: The product and one version string from the list is taken as a data point which serves as input to the S-BERT model. S-BERT produces embeddings which are pooled into two vectors, u and v. The cosine similarity is calculated with these two vectors and contrasted with the predefined correlation label to obtain the final classification