Table of Contents
Fetching ...

Chain-of-Experts (CoE): Reverse Engineering Software Bills of Materials for JavaScript Application Bundles through Code Clone Search

Leo Song, Steven H. H. Ding, Yuan Tian, Li Tao Li, Philippe Charland, Andrew Walenstein

TL;DR

This work addresses the challenge of generating Software Bill of Materials (SBoM) for JavaScript application bundles, where nested scopes, extremely long code sequences, and a vast retrieval space hinder traditional approaches. It introduces Chain-of-Experts (CoE), a multi-task architecture that unifies code segmentation, code classification, and code clone retrieval under a single end-to-end model, leveraging sliding windows, segmentation masking, and Byte-Pair Encoding. The method demonstrates competitive or superior performance across the three tasks on real-world NPM bundles, achieving high segmentation accuracy and robust clone retrieval efficiency via embedding-based search. This end-to-end framework enables scalable, provenance-aware SBoM generation for real-world JavaScript releases, improving security and compliance in software supply chains.

Abstract

A Software Bill of Materials (SBoM) is a detailed inventory of all components, libraries, and modules in a software artifact, providing traceability throughout the software supply chain. With the increasing popularity of JavaScript in software engineering due to its dynamic syntax and seamless supply chain integration, the exposure to vulnerabilities and attacks has risen significantly. A JavaScript application bundle, which is a consolidated, symbol-stripped, and optimized assembly of code for deployment purpose. Generating a SBoM from a JavaScript application bundle through a reverse-engineering process ensures the integrity, security, and compliance of the supplier's software release, even without access to the original dependency graphs. This paper presents the first study on SBoM generation for JavaScript application bundles. We identify three key challenges for this task, i.e., nested code scopes, extremely long sequences, and large retrieval spaces. To address these challenges, we introduce Chain-of-Experts (CoE), a multi-task deep learning model designed to generate SBoMs through three tasks: code segmentation, code classification, and code clone retrieval. We evaluate CoE against individual task-specific solutions on 500 web application bundles with over 66,000 dependencies. Our experimental results demonstrate that CoE offers competitive outcomes with less training and inference time when compared with combined individual task-specific solutions. Consequently, CoE provides the first scalable, efficient, and end-to-end solution for the SBoM generation of real-world JavaScript application bundles.

Chain-of-Experts (CoE): Reverse Engineering Software Bills of Materials for JavaScript Application Bundles through Code Clone Search

TL;DR

This work addresses the challenge of generating Software Bill of Materials (SBoM) for JavaScript application bundles, where nested scopes, extremely long code sequences, and a vast retrieval space hinder traditional approaches. It introduces Chain-of-Experts (CoE), a multi-task architecture that unifies code segmentation, code classification, and code clone retrieval under a single end-to-end model, leveraging sliding windows, segmentation masking, and Byte-Pair Encoding. The method demonstrates competitive or superior performance across the three tasks on real-world NPM bundles, achieving high segmentation accuracy and robust clone retrieval efficiency via embedding-based search. This end-to-end framework enables scalable, provenance-aware SBoM generation for real-world JavaScript releases, improving security and compliance in software supply chains.

Abstract

A Software Bill of Materials (SBoM) is a detailed inventory of all components, libraries, and modules in a software artifact, providing traceability throughout the software supply chain. With the increasing popularity of JavaScript in software engineering due to its dynamic syntax and seamless supply chain integration, the exposure to vulnerabilities and attacks has risen significantly. A JavaScript application bundle, which is a consolidated, symbol-stripped, and optimized assembly of code for deployment purpose. Generating a SBoM from a JavaScript application bundle through a reverse-engineering process ensures the integrity, security, and compliance of the supplier's software release, even without access to the original dependency graphs. This paper presents the first study on SBoM generation for JavaScript application bundles. We identify three key challenges for this task, i.e., nested code scopes, extremely long sequences, and large retrieval spaces. To address these challenges, we introduce Chain-of-Experts (CoE), a multi-task deep learning model designed to generate SBoMs through three tasks: code segmentation, code classification, and code clone retrieval. We evaluate CoE against individual task-specific solutions on 500 web application bundles with over 66,000 dependencies. Our experimental results demonstrate that CoE offers competitive outcomes with less training and inference time when compared with combined individual task-specific solutions. Consequently, CoE provides the first scalable, efficient, and end-to-end solution for the SBoM generation of real-world JavaScript application bundles.
Paper Structure (32 sections, 5 equations, 5 figures, 6 tables, 1 algorithm)

This paper contains 32 sections, 5 equations, 5 figures, 6 tables, 1 algorithm.

Figures (5)

  • Figure 1: We propose to segment, classify, and the clone search against a JavaScript application bundle file to recover its dependency graph through a reverse engineering process.
  • Figure 2: The architectural design and data flow for the CoE model. Expert models are positioned according to the order of tasks. Each window of the inputs is fed into the first backbone model and the classification expert. Class tokens are aggregated based on these predictions and processed through a second backbone model following segmentation. This setup facilitates the subsequent tasks of code classification and code clone retrieval, employing a classification expert model and cosine similarity metrics.
  • Figure 3: The figure illustrates the inputs and outputs of the sliding window operation. The vertical red dotted lines represent segment boundaries, while different color blocks indicate different segments. The entire input script is divided into $N$ windows, each of size $win\_size$. The number of input tokens for the CoE model, denoted as $seq\_len$, is equal to $N \times win\_size$. Each window is labeled with two pieces of information: the presence of a segment boundary and the class name. Additionally, each window is paired with a source file and a random contrastive file for cosine similarity matching. It is important to note that the pairs must be padded or truncated to match the $seq\_len$ limit.
  • Figure 4: The figure illustrates the segmentation masking, where different color blocks are activated attentions and grey areas are masked, such that tokens can only be attended to each other within the same segment.
  • Figure 5: Actual boundaries (vertical dotted lines) vs accumulated predictions (the solid line) by one step sliding window for a bundle file with 35,000 tokens.