Table of Contents
Fetching ...

Towards Code Watermarking with Dual-Channel Transformations

Borui Yang, Wei Li, Liyao Xiang, Bo Li

TL;DR

This paper addresses ownership verification for source code by embedding watermarks into code without altering functionality. It proposes SrcMarker, a dual-channel, AST-based watermarking system that uses MutableAST and a feature-space approximation to enable end-to-end training. Key contributions include a language-agnostic transformation pipeline, an end-to-end trainable embedding/extraction framework, and project-level verification with strong robustness across languages. Empirical results show SrcMarker outperforms natural-language watermark baselines in accuracy, efficiency, and semantic preservation, offering a scalable solution for code provenance.

Abstract

The expansion of the open source community and the rise of large language models have raised ethical and security concerns on the distribution of source code, such as misconduct on copyrighted code, distributions without proper licenses, or misuse of the code for malicious purposes. Hence it is important to track the ownership of source code, in which watermarking is a major technique. Yet, drastically different from natural languages, source code watermarking requires far stricter and more complicated rules to ensure the readability as well as the functionality of the source code. Hence we introduce SrcMarker, a watermarking system to unobtrusively encode ID bitstrings into source code, without affecting the usage and semantics of the code. To this end, SrcMarker performs transformations on an AST-based intermediate representation that enables unified transformations across different programming languages. The core of the system utilizes learning-based embedding and extraction modules to select rule-based transformations for watermarking. In addition, a novel feature-approximation technique is designed to tackle the inherent non-differentiability of rule selection, thus seamlessly integrating the rule-based transformations and learning-based networks into an interconnected system to enable end-to-end training. Extensive experiments demonstrate the superiority of SrcMarker over existing methods in various watermarking requirements.

Towards Code Watermarking with Dual-Channel Transformations

TL;DR

This paper addresses ownership verification for source code by embedding watermarks into code without altering functionality. It proposes SrcMarker, a dual-channel, AST-based watermarking system that uses MutableAST and a feature-space approximation to enable end-to-end training. Key contributions include a language-agnostic transformation pipeline, an end-to-end trainable embedding/extraction framework, and project-level verification with strong robustness across languages. Empirical results show SrcMarker outperforms natural-language watermark baselines in accuracy, efficiency, and semantic preservation, offering a scalable solution for code provenance.

Abstract

The expansion of the open source community and the rise of large language models have raised ethical and security concerns on the distribution of source code, such as misconduct on copyrighted code, distributions without proper licenses, or misuse of the code for malicious purposes. Hence it is important to track the ownership of source code, in which watermarking is a major technique. Yet, drastically different from natural languages, source code watermarking requires far stricter and more complicated rules to ensure the readability as well as the functionality of the source code. Hence we introduce SrcMarker, a watermarking system to unobtrusively encode ID bitstrings into source code, without affecting the usage and semantics of the code. To this end, SrcMarker performs transformations on an AST-based intermediate representation that enables unified transformations across different programming languages. The core of the system utilizes learning-based embedding and extraction modules to select rule-based transformations for watermarking. In addition, a novel feature-approximation technique is designed to tackle the inherent non-differentiability of rule selection, thus seamlessly integrating the rule-based transformations and learning-based networks into an interconnected system to enable end-to-end training. Extensive experiments demonstrate the superiority of SrcMarker over existing methods in various watermarking requirements.
Paper Structure (26 sections, 7 equations, 10 figures, 11 tables)

This paper contains 26 sections, 7 equations, 10 figures, 11 tables.

Figures (10)

  • Figure 1: Overview of SrcMarker's embedding and extraction.
  • Figure 2: The architecture of SrcMarker. Note that the two "Code Encoder" blocks (in embedding and extraction modules respectively) refer to the same neural network, and the same applies to the two "Watermark Decoder" blocks (in extraction and approximation modules).
  • Figure 3: Overview of the code transformation pipeline in our work. Components in gray belongs to MutableAST.
  • Figure 4: Network architecture of the feature approximator with probabilistic masking mechanism.
  • Figure 5: A code snippet watermarked by SrcMarker. Corresponding changes are highlighted in the same color.
  • ...and 5 more figures