Table of Contents
Fetching ...

MaskSQL: Safeguarding Privacy for LLM-Based Text-to-SQL via Abstraction

Sepideh Abedini, Shubhankar Mohapatra, D. B. Emerson, Masoumeh Shafieinejad, Jesse C. Cresswell, Xi He

TL;DR

MaskSQL addresses privacy concerns in text-to-SQL by introducing a policy-guided abstraction framework that preserves essential NL-to-SQL mappings while concealing sensitive schema and data. The three-stage pipeline—Abstraction, SQL Generation, and SQL Reconstruction—uses a local RESDSQL-based ranking to shrink the schema context and a bijective token mapping to enable exact reconstruction after remote LLM execution. Empirically, MaskSQL outperforms trusted SLM-based baselines on complex queries and achieves competitive accuracy relative to untrusted LLMs, while providing measurable privacy safeguards via configurable policies. The work demonstrates a practical path to deploying high-utility text-to-SQL systems under regulatory privacy constraints and suggests extensions to broader code-generation tasks.

Abstract

Large language models (LLMs) have shown promising performance on tasks that require reasoning, such as text-to-SQL, code generation, and debugging. However, regulatory frameworks with strict privacy requirements constrain their integration into sensitive systems. State-of-the-art LLMs are also proprietary, costly, and resource-intensive, making local deployment impractical. Consequently, utilizing such LLMs often requires sharing data with third-party providers, raising privacy concerns and risking noncompliance with regulations. Although fine-tuned small language models (SLMs) can outperform LLMs on certain tasks and be deployed locally to mitigate privacy concerns, they underperform on more complex tasks such as text-to-SQL translation. In this work, we introduce MaskSQL, a text-to-SQL framework that utilizes abstraction as a privacy protection mechanism to mask sensitive information in LLM prompts. Unlike redaction, which removes content entirely, or generalization, which broadens tokens, abstraction retains essential information while discarding unnecessary details, striking an effective privacy-utility balance for the text-to-SQL task. Moreover, by providing mechanisms to control the privacy-utility tradeoff, MaskSQL facilitates adoption across a broader range of use cases. Our experimental results show that MaskSQL outperforms leading SLM-based text-to-SQL models and achieves performance approaching state-of-the-art LLM-based models, while preserving privacy.

MaskSQL: Safeguarding Privacy for LLM-Based Text-to-SQL via Abstraction

TL;DR

MaskSQL addresses privacy concerns in text-to-SQL by introducing a policy-guided abstraction framework that preserves essential NL-to-SQL mappings while concealing sensitive schema and data. The three-stage pipeline—Abstraction, SQL Generation, and SQL Reconstruction—uses a local RESDSQL-based ranking to shrink the schema context and a bijective token mapping to enable exact reconstruction after remote LLM execution. Empirically, MaskSQL outperforms trusted SLM-based baselines on complex queries and achieves competitive accuracy relative to untrusted LLMs, while providing measurable privacy safeguards via configurable policies. The work demonstrates a practical path to deploying high-utility text-to-SQL systems under regulatory privacy constraints and suggests extensions to broader code-generation tasks.

Abstract

Large language models (LLMs) have shown promising performance on tasks that require reasoning, such as text-to-SQL, code generation, and debugging. However, regulatory frameworks with strict privacy requirements constrain their integration into sensitive systems. State-of-the-art LLMs are also proprietary, costly, and resource-intensive, making local deployment impractical. Consequently, utilizing such LLMs often requires sharing data with third-party providers, raising privacy concerns and risking noncompliance with regulations. Although fine-tuned small language models (SLMs) can outperform LLMs on certain tasks and be deployed locally to mitigate privacy concerns, they underperform on more complex tasks such as text-to-SQL translation. In this work, we introduce MaskSQL, a text-to-SQL framework that utilizes abstraction as a privacy protection mechanism to mask sensitive information in LLM prompts. Unlike redaction, which removes content entirely, or generalization, which broadens tokens, abstraction retains essential information while discarding unnecessary details, striking an effective privacy-utility balance for the text-to-SQL task. Moreover, by providing mechanisms to control the privacy-utility tradeoff, MaskSQL facilitates adoption across a broader range of use cases. Our experimental results show that MaskSQL outperforms leading SLM-based text-to-SQL models and achieves performance approaching state-of-the-art LLM-based models, while preserving privacy.

Paper Structure

This paper contains 24 sections, 4 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: MaskSQL pipeline. Green dashed boxes delineate text and schema information contained in the "trusted environment", while red boxes denote those exposed to "untrusted third parties".
  • Figure 2: Privacy metrics of MaskSQL compared to ground-truth masking. Higher values indicate stronger privacy preservation.
  • Figure 3: An example of an SQL query that requires advanced constructs like nested queries, which Qwen2.5-7B failed to handle properly. This example is extracted from our experiments.
  • Figure 4: Abstract question, database schema, SQL query for Example \ref{['running_example']}.

Theorems & Definitions (1)

  • Example 1