Table of Contents
Fetching ...

Mathematical Foundations of Poisoning Attacks on Linear Regression over Cumulative Distribution Functions

Atsuki Sato, Martin Aumüller, Yusuke Matsui

TL;DR

This work presents a theoretical proof characterizing the optimal single-point poisoning attack and shows that the existing method yields the optimal attack, and rigorously derive the key properties that an optimal attack should satisfy.

Abstract

Learned indexes are a class of index data structures that enable fast search by approximating the cumulative distribution function (CDF) using machine learning models (Kraska et al., SIGMOD'18). However, recent studies have shown that learned indexes are vulnerable to poisoning attacks, where injecting a small number of poison keys into the training data can significantly degrade model accuracy and reduce index performance (Kornaropoulos et al., SIGMOD'22). In this work, we provide a rigorous theoretical analysis of poisoning attacks targeting linear regression models over CDFs, one of the most basic regression models and a core component in many learned indexes. Our main contributions are as follows: (i) We present a theoretical proof characterizing the optimal single-point poisoning attack and show that the existing method yields the optimal attack. (ii) We show that in multi-point attacks, the existing greedy approach is not always optimal, and we rigorously derive the key properties that an optimal attack should satisfy. (iii) We propose a method to compute an upper bound of the multi-point poisoning attack's impact and empirically demonstrate that the loss under the greedy approach is often close to this bound. Our study deepens the theoretical understanding of attack strategies against linear regression models on CDFs and provides a foundation for the theoretical evaluation of attacks and defenses on learned indexes.

Mathematical Foundations of Poisoning Attacks on Linear Regression over Cumulative Distribution Functions

TL;DR

This work presents a theoretical proof characterizing the optimal single-point poisoning attack and shows that the existing method yields the optimal attack, and rigorously derive the key properties that an optimal attack should satisfy.

Abstract

Learned indexes are a class of index data structures that enable fast search by approximating the cumulative distribution function (CDF) using machine learning models (Kraska et al., SIGMOD'18). However, recent studies have shown that learned indexes are vulnerable to poisoning attacks, where injecting a small number of poison keys into the training data can significantly degrade model accuracy and reduce index performance (Kornaropoulos et al., SIGMOD'22). In this work, we provide a rigorous theoretical analysis of poisoning attacks targeting linear regression models over CDFs, one of the most basic regression models and a core component in many learned indexes. Our main contributions are as follows: (i) We present a theoretical proof characterizing the optimal single-point poisoning attack and show that the existing method yields the optimal attack. (ii) We show that in multi-point attacks, the existing greedy approach is not always optimal, and we rigorously derive the key properties that an optimal attack should satisfy. (iii) We propose a method to compute an upper bound of the multi-point poisoning attack's impact and empirically demonstrate that the loss under the greedy approach is often close to this bound. Our study deepens the theoretical understanding of attack strategies against linear regression models on CDFs and provides a foundation for the theoretical evaluation of attacks and defenses on learned indexes.
Paper Structure (36 sections, 11 theorems, 36 equations, 18 figures, 1 table, 7 algorithms)

This paper contains 36 sections, 11 theorems, 36 equations, 18 figures, 1 table, 7 algorithms.

Key Result

Theorem 1

observ:single_point_poisoning_attack always holds. In other words, the optimal single-point attack $\mathcal{P}^\ast$ satisfies:

Figures (18)

  • Figure 1: Single-point poisoning attack on $\mathcal{K} = \{2, 11, 13, 19, 32, 36, 39\}$. By inserting a poison point, the rank of every key greater than the poison increases by one. The linear regression model is fitted on the poisoned key set, leading to a change in MSE.
  • Figure 2: Proof of \ref{['thm:single_point_poisoning_attack']}. \ref{['lem:single_point_poisoning_attack']} implies that either $E( \mathcal{K} \cup \{p^\ast - 1\})$ or $E( \mathcal{K} \cup \{p^\ast + 1\})$ is greater than $E( \mathcal{K} \cup \{p^\ast\})$.
  • Figure 3: Multi-point poisoning attack. The greedy two-point attack kornaropoulos2022price injects the poison point $12$ first (\ref{['fig:greedy_poisoning_p1']}), followed by the injection of the poison point $10$ (\ref{['fig:greedy_poisoning_p2']}). In contrast, the optimal attack is $\{37,38\}$, resulting in a higher MSE (\ref{['fig:optimal_poisoning_p2']}).
  • Figure 4: Proof of \ref{['thm:structure_of_optimal_multi_point_attack']}. \ref{['lem:multi_point_poisoning_attack']} implies that $E(\mathcal{K} \cup \mathcal{P}_{-})$ or $E(\mathcal{K} \cup \mathcal{P}_{+})$ is greater than $E(\mathcal{K} \cup \mathcal{P}^\ast)$.
  • Figure 5: Examples of the Segment + Endpoint (Seg+E) attack in original and relaxed settings.
  • ...and 13 more figures

Theorems & Definitions (18)

  • Definition 1: Linear Regression on CDFs kornaropoulos2022price
  • Definition 2: Poisoning Linear Regression on CDFs kornaropoulos2022price
  • Theorem 1
  • Lemma 1
  • Theorem 2
  • Definition 3: Relaxed Poisoning Problem
  • Theorem 3
  • Theorem 4
  • Theorem 5
  • Theorem 6
  • ...and 8 more