Metricizing the Euclidean Space towards Desired Distance Relations in Point Clouds

Stefan Rass; Sandra König; Shahzad Ahmad; Maksim Goman

Metricizing the Euclidean Space towards Desired Distance Relations in Point Clouds

Stefan Rass, Sandra König, Shahzad Ahmad, Maksim Goman

TL;DR

This work asks whether a Euclidean point cloud can be endowed with a metric that fixes arbitrary pairwise distances, thereby steering clustering results. It proposes a two-pronged approach: first, for $m=|Y|=O(\sqrt{\ell})$, construct a norm on $\mathbb{R}^\ell$ that realizes prescribed distances up to a common scale; second, remove the bound on $m$ by embedding into $\mathbb{R}^h$ with $h=\binom{m}{2}$ and building a norm $||\cdot||_Q$ that enforces the desired proximity relations in high dimension. The paper further introduces an $\\varepsilon$-semimetric $\\tilde{d}$ that realizes the target distances directly in the original space, with the triangle inequality holding up to an additive error. Experimental demonstrations show that standard clustering algorithms like $k$-Means and DBSCAN can be steered to produce pre-chosen outcomes by supplying tailored distance measures, highlighting security risks in clustering pipelines and the need for transparent, verifiable algorithm configurations. Overall, the work reveals fundamental vulnerabilities in proximity-based clustering and establishes constructive methods to manipulate clustering via metric design, urging rigorous safeguards in AI systems.

Abstract

Given a set of points in the Euclidean space $\mathbb{R}^\ell$ with $\ell>1$, the pairwise distances between the points are determined by their spatial location and the metric $d$ that we endow $\mathbb{R}^\ell$ with. Hence, the distance $d(\mathbf x,\mathbf y)=δ$ between two points is fixed by the choice of $\mathbf x$ and $\mathbf y$ and $d$. We study the related problem of fixing the value $δ$, and the points $\mathbf x,\mathbf y$, and ask if there is a topological metric $d$ that computes the desired distance $δ$. We demonstrate this problem to be solvable by constructing a metric to simultaneously give desired pairwise distances between up to $O(\sqrt\ell)$ many points in $\mathbb{R}^\ell$. We then introduce the notion of an $\varepsilon$-semimetric $\tilde{d}$ to formulate our main result: for all $\varepsilon>0$, for all $m\geq 1$, for any choice of $m$ points $\mathbf y_1,\ldots,\mathbf y_m\in\mathbb{R}^\ell$, and all chosen sets of values $\{δ_{ij}\geq 0: 1\leq i<j\leq m\}$, there exists an $\varepsilon$-semimetric $\tildeδ:\mathbb{R}^\ell\times \mathbb{R}^\ell\to\mathbb{R}$ such that $\tilde{d}(\mathbf y_i,\mathbf y_j)=δ_{ij}$, i.e., the desired distances are accomplished, irrespectively of the topology that the Euclidean or other norms would induce. We showcase our results by using them to attack unsupervised learning algorithms, specifically $k$-Means and density-based (DBSCAN) clustering algorithms. These have manifold applications in artificial intelligence, and letting them run with externally provided distance measures constructed in the way as shown here, can make clustering algorithms produce results that are pre-determined and hence malleable. This demonstrates that the results of clustering algorithms may not generally be trustworthy, unless there is a standardized and fixed prescription to use a specific distance function.

Metricizing the Euclidean Space towards Desired Distance Relations in Point Clouds

TL;DR

, construct a norm on

that realizes prescribed distances up to a common scale; second, remove the bound on

by embedding into

with

and building a norm

that enforces the desired proximity relations in high dimension. The paper further introduces an

-semimetric

that realizes the target distances directly in the original space, with the triangle inequality holding up to an additive error. Experimental demonstrations show that standard clustering algorithms like

-Means and DBSCAN can be steered to produce pre-chosen outcomes by supplying tailored distance measures, highlighting security risks in clustering pipelines and the need for transparent, verifiable algorithm configurations. Overall, the work reveals fundamental vulnerabilities in proximity-based clustering and establishes constructive methods to manipulate clustering via metric design, urging rigorous safeguards in AI systems.

Abstract

Given a set of points in the Euclidean space

with

, the pairwise distances between the points are determined by their spatial location and the metric

that we endow

with. Hence, the distance

between two points is fixed by the choice of

and

. We study the related problem of fixing the value

, and the points

, and ask if there is a topological metric

that computes the desired distance

. We demonstrate this problem to be solvable by constructing a metric to simultaneously give desired pairwise distances between up to

many points in

. We then introduce the notion of an

-semimetric

to formulate our main result: for all

, for all

, for any choice of

points

, and all chosen sets of values

, there exists an

-semimetric

such that

, i.e., the desired distances are accomplished, irrespectively of the topology that the Euclidean or other norms would induce. We showcase our results by using them to attack unsupervised learning algorithms, specifically

-Means and density-based (DBSCAN) clustering algorithms. These have manifold applications in artificial intelligence, and letting them run with externally provided distance measures constructed in the way as shown here, can make clustering algorithms produce results that are pre-determined and hence malleable. This demonstrates that the results of clustering algorithms may not generally be trustworthy, unless there is a standardized and fixed prescription to use a specific distance function.

Paper Structure (18 sections, 4 theorems, 11 equations, 4 figures, 4 tables)

This paper contains 18 sections, 4 theorems, 11 equations, 4 figures, 4 tables.

Introduction
Showcase: Security of Clustering Algorithms
Our Contribution
Preliminaries
The Problem
Results
Embedding Points at desired Distances
Dropping the constraint on $m$
$\varepsilon$-Semimetrics to Manipulate Distances
Experimental Demonstration
Attacking Clustering: General Outline
Manipulating $k$-Means
Results:
Manipulating DBSCAN
Results:
...and 3 more sections

Key Result

Lemma 1

Let $\mathbf x_1,\ldots,\mathbf x_n\in \mathds{R}^n$ be independently, but not necessarily identically sampled from distributions that are all absolutely continuous w.r.t. the Lebesgue measure on $\mathds{R}^n$. Put the vectors as columns into an $(n\times n)$-matrix $\mathbf M$. Then, $\mathbf M$ h

Figures (4)

Figure 1: Different neighborhoods depending on the underlying metric
Figure 2: Linear independence by (stochastically) independent neighbor choices
Figure 3: Distance of $\mathbf y_i,\mathbf y_j$ based on the separation of $\mathbf z_i$ and $\mathbf z_j$
Figure 4: Violation of the triangle inequality: the triple $\mathbf y_i,\mathbf y_k,\mathbf y_j$ would satisfy the triangle inequality on the distances between them. However, the distances between $(\mathbf z_{i,2},\mathbf z_{k,1})$ and between $(\mathbf z_{k,2},\mathbf z_{j,2})$ add up to a value less than the direct distance from $\mathbf z_{i,1}$ to $\mathbf z_{j,1}$. Hence, the triangle inequality cannot generally hold for distances between $\mathbf y_i,\mathbf y_j,\mathbf y_k$ evaluated on the $\mathbf z$-neighbors as done by the $\varepsilon$-semimetric $\tilde{d}$, since the neighbors are determined under other constraints than this inequality.

Theorems & Definitions (12)

Definition 1: Metric, ($\varepsilon$-)Semimetric, and Generalizations
Lemma 1
proof
Theorem 1
proof
Theorem 2
proof : Proof of Theorem \ref{['thm:embedding-any-number-of-points']}
Claim 1
proof : Proof of Claim \ref{['lem:almost-surely-linearly-independent-neighbors']}
Theorem 3
...and 2 more

Metricizing the Euclidean Space towards Desired Distance Relations in Point Clouds

TL;DR

Abstract

Metricizing the Euclidean Space towards Desired Distance Relations in Point Clouds

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (12)