Table of Contents
Fetching ...

Subword Embedding from Bytes Gains Privacy without Sacrificing Accuracy and Complexity

Mengjiao Zhang, Jia Xu

TL;DR

Subword Embedding from Bytes (SEB) is proposed and can effectively protect against embedding-based attacks from recovering original sentences in federated learning and obtains comparable and even better results over standard subword embedding methods in machine translation, sentiment analysis, and language modeling with even lower time and space complexity.

Abstract

While NLP models significantly impact our lives, there are rising concerns about privacy invasion. Although federated learning enhances privacy, attackers may recover private training data by exploiting model parameters and gradients. Therefore, protecting against such embedding attacks remains an open challenge. To address this, we propose Subword Embedding from Bytes (SEB) and encode subwords to byte sequences using deep neural networks, making input text recovery harder. Importantly, our method requires a smaller memory with $256$ bytes of vocabulary while keeping efficiency with the same input length. Thus, our solution outperforms conventional approaches by preserving privacy without sacrificing efficiency or accuracy. Our experiments show SEB can effectively protect against embedding-based attacks from recovering original sentences in federated learning. Meanwhile, we verify that SEB obtains comparable and even better results over standard subword embedding methods in machine translation, sentiment analysis, and language modeling with even lower time and space complexity.

Subword Embedding from Bytes Gains Privacy without Sacrificing Accuracy and Complexity

TL;DR

Subword Embedding from Bytes (SEB) is proposed and can effectively protect against embedding-based attacks from recovering original sentences in federated learning and obtains comparable and even better results over standard subword embedding methods in machine translation, sentiment analysis, and language modeling with even lower time and space complexity.

Abstract

While NLP models significantly impact our lives, there are rising concerns about privacy invasion. Although federated learning enhances privacy, attackers may recover private training data by exploiting model parameters and gradients. Therefore, protecting against such embedding attacks remains an open challenge. To address this, we propose Subword Embedding from Bytes (SEB) and encode subwords to byte sequences using deep neural networks, making input text recovery harder. Importantly, our method requires a smaller memory with bytes of vocabulary while keeping efficiency with the same input length. Thus, our solution outperforms conventional approaches by preserving privacy without sacrificing efficiency or accuracy. Our experiments show SEB can effectively protect against embedding-based attacks from recovering original sentences in federated learning. Meanwhile, we verify that SEB obtains comparable and even better results over standard subword embedding methods in machine translation, sentiment analysis, and language modeling with even lower time and space complexity.

Paper Structure

This paper contains 55 sections, 1 equation, 6 figures, 14 tables, 2 algorithms.

Figures (6)

  • Figure 1: An attack example of recovering text in FL. (a): An FL framework. (b) and (c): Recovering text using embedding gradients of subwords and bytes.
  • Figure 2: (a): An overview of the transformer model with . (b): An example of calculating subword embeddings with byte embedding.
  • Figure 3: The distribution of subword number, unique subword number, and unique byte number in a batch when batch size is 1, 4, 16. The vocabulary sizes of subwords and bytes are 50K and 256.
  • Figure 4: The average coverage of subwords given a random set of bytes with GPT-2 tokenizer.
  • Figure 5: Recovery performance for batch size 1, 2, 4, 8 on WikiText-103.
  • ...and 1 more figures