Table of Contents
Fetching ...

UTOPIA: Unlearnable Tabular Data via Decoupled Shortcut Embedding

Jiaming He, Fuming Luo, Hongwei Li, Wenbo Jiang, Wenshu Fan, Zhenbo Shi, Xudong Jiang, Yi Yu

TL;DR

UTOPIA addresses unlearnable data for tabular domains by exploiting spectral dominance between poison and clean feature subspaces. It combines a theoretical bound showing that a large poison spectrum relative to the clean spectrum forces the model to rely on a global confounder, with a practical, constraint-aware perturbation strategy that decouples features into a dominant and a recessive channel via Influence-Guided Subspace Decoupling and differential gradient steering. The core contribution is a bi-level, geometry-aware UE that delivers a dominant shortcut while preserving tabular validity, achieving near-random performance for unauthorized training and transferring across architectures. The work has practical implications for data privacy in finance and healthcare, while highlighting the need for defenses and governance to prevent misuse of such protection mechanisms.

Abstract

Unlearnable examples (UE) have emerged as a practical mechanism to prevent unauthorized model training on private vision data, while extending this protection to tabular data is nontrivial. Tabular data in finance and healthcare is highly sensitive, yet existing UE methods transfer poorly because tabular features mix numerical and categorical constraints and exhibit saliency sparsity, with learning dominated by a few dimensions. Under a Spectral Dominance condition, we show certified unlearnability is feasible when the poison spectrum overwhelms the clean semantic spectrum. Guided by this, we propose Unlearnable Tabular Data via DecOuPled Shortcut EmbeddIng (UTOPIA), which exploits feature redundancy to decouple optimization into two channels: high saliency features for semantic obfuscation and low saliency redundant features for embedding a hyper correlated shortcut, yielding constraint-aware dominant shortcuts while preserving tabular validity. Extensive experiments across tabular datasets and models show UTOPIA drives unauthorized training toward near random performance, outperforming strong UE baselines and transferring well across architectures.

UTOPIA: Unlearnable Tabular Data via Decoupled Shortcut Embedding

TL;DR

UTOPIA addresses unlearnable data for tabular domains by exploiting spectral dominance between poison and clean feature subspaces. It combines a theoretical bound showing that a large poison spectrum relative to the clean spectrum forces the model to rely on a global confounder, with a practical, constraint-aware perturbation strategy that decouples features into a dominant and a recessive channel via Influence-Guided Subspace Decoupling and differential gradient steering. The core contribution is a bi-level, geometry-aware UE that delivers a dominant shortcut while preserving tabular validity, achieving near-random performance for unauthorized training and transferring across architectures. The work has practical implications for data privacy in finance and healthcare, while highlighting the need for defenses and governance to prevent misuse of such protection mechanisms.

Abstract

Unlearnable examples (UE) have emerged as a practical mechanism to prevent unauthorized model training on private vision data, while extending this protection to tabular data is nontrivial. Tabular data in finance and healthcare is highly sensitive, yet existing UE methods transfer poorly because tabular features mix numerical and categorical constraints and exhibit saliency sparsity, with learning dominated by a few dimensions. Under a Spectral Dominance condition, we show certified unlearnability is feasible when the poison spectrum overwhelms the clean semantic spectrum. Guided by this, we propose Unlearnable Tabular Data via DecOuPled Shortcut EmbeddIng (UTOPIA), which exploits feature redundancy to decouple optimization into two channels: high saliency features for semantic obfuscation and low saliency redundant features for embedding a hyper correlated shortcut, yielding constraint-aware dominant shortcuts while preserving tabular validity. Extensive experiments across tabular datasets and models show UTOPIA drives unauthorized training toward near random performance, outperforming strong UE baselines and transferring well across architectures.
Paper Structure (19 sections, 2 theorems, 20 equations, 8 figures, 7 tables, 1 algorithm)

This paper contains 19 sections, 2 theorems, 20 equations, 8 figures, 7 tables, 1 algorithm.

Key Result

Theorem 4.2

Under the orthogonality Assumption assump, the learning dynamics decouple along the clean and poison subspaces. Let $\mathbf{w}^* = \mathbf{w}_c^* + \mathbf{w}_p^*$ denote the equilibrium solution of the gradient dynamics. Then there exists a constant $\xi > 0$ (encoding label alignment) and a stric where $\kappa \!=\! {\lambda_p}/{\lambda_c}$ is the spectral imbalance ratio. For any fixed $\lambd

Figures (8)

  • Figure 1: Illustration of the UTD that prevents unauthorized training on tabular data by injecting constraint-aware perturbations, transforming clean records into an unlearnable dataset that leads to poor generalization and performance on clean test data.
  • Figure 2: Comparison of convergence flatness, i.e.,$\mathcal{L}_{\theta}$ Landscape.
  • Figure 3: Comparison of noisy robustness.
  • Figure 4: Hyperparameter sensitivity analysis on JV dataset: The orange line denotes the average performance across all models.
  • Figure 5: Feature gradient saliency analysis and accuracy when remove top-K and bottom-K features on California Housing dataset using FTT.
  • ...and 3 more figures

Theorems & Definitions (5)

  • Theorem 4.2: Analytical Clean Weight Suppression
  • Theorem 4.3: Certified Accuracy Bound via Lambert Dynamics
  • Remark 4.4
  • proof
  • proof