Table of Contents
Fetching ...

A General Pseudonymization Framework for Cloud-Based LLMs: Replacing Privacy Information in Controlled Text Generation

Shilong Hou, Ruilin Shang, Zi Long, Xianghua Fu, Yin Chen

TL;DR

The paper tackles privacy risks when users interact with cloud-based LLMs by formalizing a general pseudonymization framework for inference-time protection. It decomposes the approach into three modular components—detection of privacy information, generation of non-sensitive replacement candidates, and replacement during text generation—with a controllable text-generation method to preserve semantic integrity. Experimental results across QA, summarization, inference, and translation show the framework can achieve a favorable privacy-utility balance, approaching large cloud-LLM baselines while outperforming small local models. The work provides practical guidance for secure remote LLM usage and releases code to enable adoption and further research.

Abstract

An increasing number of companies have begun providing services that leverage cloud-based large language models (LLMs), such as ChatGPT. However, this development raises substantial privacy concerns, as users' prompts are transmitted to and processed by the model providers. Among the various privacy protection methods for LLMs, those implemented during the pre-training and fine-tuning phrases fail to mitigate the privacy risks associated with the remote use of cloud-based LLMs by users. On the other hand, methods applied during the inference phrase are primarily effective in scenarios where the LLM's inference does not rely on privacy-sensitive information. In this paper, we outline the process of remote user interaction with LLMs and, for the first time, propose a detailed definition of a general pseudonymization framework applicable to cloud-based LLMs. The experimental results demonstrate that the proposed framework strikes an optimal balance between privacy protection and utility. The code for our method is available to the public at https://github.com/Mebymeby/Pseudonymization-Framework.

A General Pseudonymization Framework for Cloud-Based LLMs: Replacing Privacy Information in Controlled Text Generation

TL;DR

The paper tackles privacy risks when users interact with cloud-based LLMs by formalizing a general pseudonymization framework for inference-time protection. It decomposes the approach into three modular components—detection of privacy information, generation of non-sensitive replacement candidates, and replacement during text generation—with a controllable text-generation method to preserve semantic integrity. Experimental results across QA, summarization, inference, and translation show the framework can achieve a favorable privacy-utility balance, approaching large cloud-LLM baselines while outperforming small local models. The work provides practical guidance for secure remote LLM usage and releases code to enable adoption and further research.

Abstract

An increasing number of companies have begun providing services that leverage cloud-based large language models (LLMs), such as ChatGPT. However, this development raises substantial privacy concerns, as users' prompts are transmitted to and processed by the model providers. Among the various privacy protection methods for LLMs, those implemented during the pre-training and fine-tuning phrases fail to mitigate the privacy risks associated with the remote use of cloud-based LLMs by users. On the other hand, methods applied during the inference phrase are primarily effective in scenarios where the LLM's inference does not rely on privacy-sensitive information. In this paper, we outline the process of remote user interaction with LLMs and, for the first time, propose a detailed definition of a general pseudonymization framework applicable to cloud-based LLMs. The experimental results demonstrate that the proposed framework strikes an optimal balance between privacy protection and utility. The code for our method is available to the public at https://github.com/Mebymeby/Pseudonymization-Framework.

Paper Structure

This paper contains 16 sections, 2 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Potential privacy breach risks in using cloud-based LLM services
  • Figure 2: Overview of pseudonymization framework for cloud-based LLMs
  • Figure 3: Workflow of pesudonymization through controllable text generation
  • Figure 4: Performance metrics and pseudonymization effectiveness of various methods across different datasets