Table of Contents
Fetching ...

WasmWalker: Path-based Code Representations for Improved WebAssembly Program Analysis

Mohammad Robati Shirzad, Patrick Lam

TL;DR

This paper tackles the challenge of Wasm program analysis by introducing WasmWalker, a pipeline that extracts common root-to-leaf AST paths from WebAssembly Text (WAT) and converts them into two representations: a variable-sized path sequence and a fixed-size code embedding (50 dimensions). The authors identify $3{,}352$ such paths from a large Ubuntu-based Wasm corpus after refining over $8 times 10^5$ raw paths, enabling path-aware inputs for deep learning models. Through two evaluation tasks—method-name prediction and precise return-type recovery—WasmWalker-based representations, especially when combined with the last $20$ instructions, consistently outperform the prior state of the art SnowWhite, achieving up to $5.36\%$ Top-1 and $11.31\%$ Top-5 improvements in method-name prediction and $8.02\%$ Top-1 and $7.92\%$ Top-5 in return-type recovery. The work demonstrates that integrating AST-path knowledge yields more informative inputs for Wasm analysis and provides embeddings that cluster semantically related methods, with potential applicability to other intermediate representations beyond Wasm.

Abstract

WebAssembly, or Wasm, is a low-level binary language that enables execution of near-native-performance code in web browsers. Wasm has proven to be useful in applications including gaming, audio and video processing, and cloud computing, providing a high-performance, low-overhead alternative to JavaScript in web development. The fast and widespread adoption of WebAssembly by all major browsers has created an opportunity for analysis tools that support this new technology. Deep learning program analysis models can greatly benefit from the program structure information included in Abstract Syntax Tree (AST)-aware code representations. To obtain such code representations, we performed an empirical analysis on the AST paths in the WebAssembly Text format of a large dataset of WebAssembly binary files compiled from source packages in the Ubuntu 18.04 repositories. After refining the collected paths, we discovered that only 3,352 unique paths appeared across all of these binary files. With this insight, we propose two novel code representations for WebAssembly binaries. These novel representations serve not only to generate fixed-size code embeddings but also to supply additional information to sequence-to-sequence models. Ultimately, our approach helps program analysis models uncover new properties from Wasm binaries, expanding our understanding of their potential. We evaluated our new code representation on two applications: (i) method name prediction and (ii) recovering precise return types. Our results demonstrate the superiority of our novel technique over previous methods. More specifically, our new method resulted in 5.36% (11.31%) improvement in Top-1 (Top-5) accuracy in method name prediction and 8.02% (7.92%) improvement in recovering precise return types, compared to the previous state-of-the-art technique, SnowWhite.

WasmWalker: Path-based Code Representations for Improved WebAssembly Program Analysis

TL;DR

This paper tackles the challenge of Wasm program analysis by introducing WasmWalker, a pipeline that extracts common root-to-leaf AST paths from WebAssembly Text (WAT) and converts them into two representations: a variable-sized path sequence and a fixed-size code embedding (50 dimensions). The authors identify such paths from a large Ubuntu-based Wasm corpus after refining over raw paths, enabling path-aware inputs for deep learning models. Through two evaluation tasks—method-name prediction and precise return-type recovery—WasmWalker-based representations, especially when combined with the last instructions, consistently outperform the prior state of the art SnowWhite, achieving up to Top-1 and Top-5 improvements in method-name prediction and Top-1 and Top-5 in return-type recovery. The work demonstrates that integrating AST-path knowledge yields more informative inputs for Wasm analysis and provides embeddings that cluster semantically related methods, with potential applicability to other intermediate representations beyond Wasm.

Abstract

WebAssembly, or Wasm, is a low-level binary language that enables execution of near-native-performance code in web browsers. Wasm has proven to be useful in applications including gaming, audio and video processing, and cloud computing, providing a high-performance, low-overhead alternative to JavaScript in web development. The fast and widespread adoption of WebAssembly by all major browsers has created an opportunity for analysis tools that support this new technology. Deep learning program analysis models can greatly benefit from the program structure information included in Abstract Syntax Tree (AST)-aware code representations. To obtain such code representations, we performed an empirical analysis on the AST paths in the WebAssembly Text format of a large dataset of WebAssembly binary files compiled from source packages in the Ubuntu 18.04 repositories. After refining the collected paths, we discovered that only 3,352 unique paths appeared across all of these binary files. With this insight, we propose two novel code representations for WebAssembly binaries. These novel representations serve not only to generate fixed-size code embeddings but also to supply additional information to sequence-to-sequence models. Ultimately, our approach helps program analysis models uncover new properties from Wasm binaries, expanding our understanding of their potential. We evaluated our new code representation on two applications: (i) method name prediction and (ii) recovering precise return types. Our results demonstrate the superiority of our novel technique over previous methods. More specifically, our new method resulted in 5.36% (11.31%) improvement in Top-1 (Top-5) accuracy in method name prediction and 8.02% (7.92%) improvement in recovering precise return types, compared to the previous state-of-the-art technique, SnowWhite.

Paper Structure

This paper contains 18 sections, 2 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: This figure illustrates the files generated while compiling C code to Wasm binaries and then extracting paths from WAT files.
  • Figure 2: Overview of the WasmWalker pipeline
  • Figure 3: Accumulative number of paths
  • Figure 4: A t-SNE 2D plot of code embeddings generated using our proposed code embedding approach. The plot shows the spreading of method names in the 2D plane, with similar method names closer to each other, highlighting the effectiveness of our embedding approach in capturing the semantic similarities among Wasm program methods. To prevent label overlapping, we manually adjusted the vertical position of points by up to 3.1%.
  • Figure 5: Model Accuracies for precise return type recovery