Table of Contents
Fetching ...

OCEAN: Open-World Contrastive Authorship Identification

Felix Mächtle, Jan-Niclas Serr, Nils Loose, Jonas Sander, Thomas Eisenbarth

TL;DR

OCEAN tackles the challenge of attributing authorship to functions in compiled binaries under extreme open-world conditions. It leverages contrastive learning with UniXcoder-based embeddings and a diverse set of program representations to distinguish same-author from different-author pairs, trained on the Conan dataset and evaluated on the Snoopy dataset. The approach outperforms prior methods in both real-world training/evaluation and open-world binary scenarios, and it includes a security-oriented capability to detect new authors in software updates via a dynamically calibrated threshold. By releasing Conan, Snoopy, and the supporting tooling, the work provides a robust, transferable framework for strengthening software supply-chain security and defending against code-injection threats.

Abstract

In an era where cyberattacks increasingly target the software supply chain, the ability to accurately attribute code authorship in binary files is critical to improving cybersecurity measures. We propose OCEAN, a contrastive learning-based system for function-level authorship attribution. OCEAN is the first framework to explore code authorship attribution on compiled binaries in an open-world and extreme scenario, where two code samples from unknown authors are compared to determine if they are developed by the same author. To evaluate OCEAN, we introduce new realistic datasets: CONAN, to improve the performance of authorship attribution systems in real-world use cases, and SNOOPY, to increase the robustness of the evaluation of such systems. We use CONAN to train our model and evaluate on SNOOPY, a fully unseen dataset, resulting in an AUROC score of 0.86 even when using high compiler optimizations. We further show that CONAN improves performance by 7% compared to the previously used Google Code Jam dataset. Additionally, OCEAN outperforms previous methods in their settings, achieving a 10% improvement over state-of-the-art SCS-Gan in scenarios analyzing source code. Furthermore, OCEAN can detect code injections from an unknown author in a software update, underscoring its value for securing software supply chains.

OCEAN: Open-World Contrastive Authorship Identification

TL;DR

OCEAN tackles the challenge of attributing authorship to functions in compiled binaries under extreme open-world conditions. It leverages contrastive learning with UniXcoder-based embeddings and a diverse set of program representations to distinguish same-author from different-author pairs, trained on the Conan dataset and evaluated on the Snoopy dataset. The approach outperforms prior methods in both real-world training/evaluation and open-world binary scenarios, and it includes a security-oriented capability to detect new authors in software updates via a dynamically calibrated threshold. By releasing Conan, Snoopy, and the supporting tooling, the work provides a robust, transferable framework for strengthening software supply-chain security and defending against code-injection threats.

Abstract

In an era where cyberattacks increasingly target the software supply chain, the ability to accurately attribute code authorship in binary files is critical to improving cybersecurity measures. We propose OCEAN, a contrastive learning-based system for function-level authorship attribution. OCEAN is the first framework to explore code authorship attribution on compiled binaries in an open-world and extreme scenario, where two code samples from unknown authors are compared to determine if they are developed by the same author. To evaluate OCEAN, we introduce new realistic datasets: CONAN, to improve the performance of authorship attribution systems in real-world use cases, and SNOOPY, to increase the robustness of the evaluation of such systems. We use CONAN to train our model and evaluate on SNOOPY, a fully unseen dataset, resulting in an AUROC score of 0.86 even when using high compiler optimizations. We further show that CONAN improves performance by 7% compared to the previously used Google Code Jam dataset. Additionally, OCEAN outperforms previous methods in their settings, achieving a 10% improvement over state-of-the-art SCS-Gan in scenarios analyzing source code. Furthermore, OCEAN can detect code injections from an unknown author in a software update, underscoring its value for securing software supply chains.

Paper Structure

This paper contains 20 sections, 4 equations, 5 figures, 8 tables.

Figures (5)

  • Figure 1: Training and inference of OCEAN. All steps are described in \ref{['sec:pipeline']}.
  • Figure 2: RQ4: Histogram of cosine similarities of an unseen dataset for the same and different authors. The expected minimum for same authors is at $\theta = 0.72$.
  • Figure 3: Performance comparison of samples with Ghidra warnings. Using samples with warnings does not decrease detection performance, but may increase it.
  • Figure 4: CS1: t-SNE visualization of embedding vectors for functions written by two authors. Each dot represents a function, with colors indicating the author.
  • Figure 5: Approximation of the probability distribution of distances for known authors. The malware exceeds this threshold, making it detectable by our method.