Density based Spatial Clustering of Lines via Probabilistic Generation of Neighbourhood

Akanksha Das; Malay Bhattacharyya

Density based Spatial Clustering of Lines via Probabilistic Generation of Neighbourhood

Akanksha Das, Malay Bhattacharyya

TL;DR

The paper addresses clustering lines and line segments in high-dimensional spaces where a traditional distance metric is ill-suited. It introduces DeLi, a density-based, DBSCAN-inspired framework that generates probabilistic $f_l$-neighbourhoods along each line using a scaling factor $\alpha_l=V/V(N_{f_l,l})$ to form density-based clusters without prespecifying the number of clusters. A key contribution is the $f_l$-neighbourhood concept and the $l$-cardinality criterion, enabling outlier detection and effective handling of incomplete data by embedding domain knowledge through $f_l$. Empirical results on synthetic and real-world line datasets, as well as point datasets with missing entries (e.g., Sporulation), demonstrate robustness to noise and the ability to recover meaningful clusters, with domain knowledge improving stability in missing-value scenarios.

Abstract

Density based spatial clustering of points in $\mathbb{R}^n$ has a myriad of applications in a variety of industries. We generalise this problem to the density based clustering of lines in high-dimensional spaces, keeping in mind there exists no valid distance measure that follows the triangle inequality for lines. In this paper, we design a clustering algorithm that generates a customised neighbourhood for a line of a fixed volume (given as a parameter), based on an optional parameter as a continuous probability density function. This algorithm is not sensitive to the outliers and can effectively identify the noise in the data using a cardinality parameter. One of the pivotal applications of this algorithm is clustering data points in $\mathbb{R}^n$ with missing entries, while utilising the domain knowledge of the respective data. In particular, the proposed algorithm is able to cluster $n$-dimensional data points that contain at least $(n-1)$-dimensional information. We illustrate the neighbourhoods for the standard probability distributions with continuous probability density functions and demonstrate the effectiveness of our algorithm on various synthetic and real-world datasets (e.g., rail and road networks). The experimental results also highlight its application in clustering incomplete data.

Density based Spatial Clustering of Lines via Probabilistic Generation of Neighbourhood

TL;DR

-neighbourhoods along each line using a scaling factor

to form density-based clusters without prespecifying the number of clusters. A key contribution is the

-neighbourhood concept and the

-cardinality criterion, enabling outlier detection and effective handling of incomplete data by embedding domain knowledge through

. Empirical results on synthetic and real-world line datasets, as well as point datasets with missing entries (e.g., Sporulation), demonstrate robustness to noise and the ability to recover meaningful clusters, with domain knowledge improving stability in missing-value scenarios.

Abstract

Density based spatial clustering of points in

has a myriad of applications in a variety of industries. We generalise this problem to the density based clustering of lines in high-dimensional spaces, keeping in mind there exists no valid distance measure that follows the triangle inequality for lines. In this paper, we design a clustering algorithm that generates a customised neighbourhood for a line of a fixed volume (given as a parameter), based on an optional parameter as a continuous probability density function. This algorithm is not sensitive to the outliers and can effectively identify the noise in the data using a cardinality parameter. One of the pivotal applications of this algorithm is clustering data points in

with missing entries, while utilising the domain knowledge of the respective data. In particular, the proposed algorithm is able to cluster

-dimensional data points that contain at least

-dimensional information. We illustrate the neighbourhoods for the standard probability distributions with continuous probability density functions and demonstrate the effectiveness of our algorithm on various synthetic and real-world datasets (e.g., rail and road networks). The experimental results also highlight its application in clustering incomplete data.

Paper Structure (16 sections, 2 theorems, 1 equation, 1 figure, 4 tables, 1 algorithm)

This paper contains 16 sections, 2 theorems, 1 equation, 1 figure, 4 tables, 1 algorithm.

Introduction
Motivation
Related Work
Proposed Problem
Basic Notations
Problem Formulation
Proposed Method
Theoretical Insights
Algorithm
Complexity Analysis
Results
Dataset Details
Empirical Analysis
Results on Line Datasets
Results on Point Datasets with Missing Entries
...and 1 more sections

Key Result

Theorem 1

$P_l$ exists and is unique with respect to the line or line segment $l$.

Figures (1)

Figure 1: Clustering results for (a) Convex dataset considering $\alpha_l = 12$ and $c = 10$, (b) Doughnut dataset considering $\alpha_l = 12$ and $c = 5$, (c) Doughnut dataset considering $\alpha_l = 12$ and $c = 8$, (d) Sparse Tripod dataset considering $\alpha_l = 0.008$ and $c = 3$, (e) Dense Tripod dataset considering $\alpha_l = 0.005$ and $c = 3$, (f) Broken Beads 1 dataset considering $\alpha_l = 0.005$ and $c = 3$, (g) Broken Beads 2 dataset considering $\alpha_l = 0.005$ and $c = 8$, (h) Spectral Band 1 dataset considering $\alpha_l = 0.008$ and $c = 5$, (i) Spectral Band 2 dataset considering $\alpha_l = 0.005$ and $c = 3$.

Theorems & Definitions (7)

Theorem 1
Definition 1: $f$-neighbourhood of a line
Theorem 2
Definition 2: Scaling factor of $f$-neighbourhood of a line
Definition 3: Neighbourhood of a line
Definition 4: Neighbourhood Relation
Definition 5: $l$-cardinality

Density based Spatial Clustering of Lines via Probabilistic Generation of Neighbourhood

TL;DR

Abstract

Density based Spatial Clustering of Lines via Probabilistic Generation of Neighbourhood

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (1)

Theorems & Definitions (7)