Table of Contents
Fetching ...

A shortest-path based clustering algorithm for joint human-machine analysis of complex datasets

Diego Ulisse Pizzagalli, Santiago Fernandez Gonzalez, Rolf Krause

TL;DR

This work proposes an algorithm that achieves clustering by exploring the paths between points and supports the integration of existing knowledge about admissible and non-admissible clusters by training a path classifier.

Abstract

Clustering is a technique for the analysis of datasets obtained by empirical studies in several disciplines with a major application for biomedical research. Essentially, clustering algorithms are executed by machines aiming at finding groups of related points in a dataset. However, the result of grouping depends on both metrics for point-to-point similarity and rules for point-to-group association. Indeed, non-appropriate metrics and rules can lead to undesirable clustering artifacts. This is especially relevant for datasets, where groups with heterogeneous structures co-exist. In this work, we propose an algorithm that achieves clustering by exploring the paths between points. This allows both, to evaluate the properties of the path (such as gaps, density variations, etc.), and expressing the preference for certain paths. Moreover, our algorithm supports the integration of existing knowledge about admissible and non-admissible clusters by training a path classifier. We demonstrate the accuracy of the proposed method on challenging datasets including points from synthetic shapes in publicly available benchmarks and microscopy data.

A shortest-path based clustering algorithm for joint human-machine analysis of complex datasets

TL;DR

This work proposes an algorithm that achieves clustering by exploring the paths between points and supports the integration of existing knowledge about admissible and non-admissible clusters by training a path classifier.

Abstract

Clustering is a technique for the analysis of datasets obtained by empirical studies in several disciplines with a major application for biomedical research. Essentially, clustering algorithms are executed by machines aiming at finding groups of related points in a dataset. However, the result of grouping depends on both metrics for point-to-point similarity and rules for point-to-group association. Indeed, non-appropriate metrics and rules can lead to undesirable clustering artifacts. This is especially relevant for datasets, where groups with heterogeneous structures co-exist. In this work, we propose an algorithm that achieves clustering by exploring the paths between points. This allows both, to evaluate the properties of the path (such as gaps, density variations, etc.), and expressing the preference for certain paths. Moreover, our algorithm supports the integration of existing knowledge about admissible and non-admissible clusters by training a path classifier. We demonstrate the accuracy of the proposed method on challenging datasets including points from synthetic shapes in publicly available benchmarks and microscopy data.

Paper Structure

This paper contains 7 sections, 1 equation, 5 figures.

Figures (5)

  • Figure 1: Association rules on a simplified example. A. Raw data-points in a 2d space. B. Color-coded point density (red corresponds to low density, yellow corresponds to high density). C. Local association rule (each point connected to the closest point with a higher density). D. Proposed global association rule using minimax cost function.
  • Figure 2: Different non-decreasing path costs. A. Cumulative euclidean cost, used in the standard Dijkstra algorithm. B.$L_\infty$ cost (minimax). This choice avoids gaps and the maximum path cost is bounded by the value of the longest edge in a path. C. Cumulative non-euclidean cost. Here $f(.)$ is the cost of a path defined on a local window $w$. $f(.)$ can derive from a trained classifier to express path preferences D. The non-euclidean path cost $f(.)$ with a minimax formulation. E. Example indicating the execution of the algorithm, using a trained model as path-cost function. A path $\Gamma(i)$ has been identified from the starting node $S$ until to the node $i$. The cost to extending this path to the node $k$ is evaluated as a function of the latest segment of the path and the proposed edge $f(\Gamma_w, k)$. If the proposed path cost $D_{curr}$ is less or equal than the best distance ($D*$), then the edge is included in the current path and the node k is associated to the current cluster.
  • Figure 3: Qualitative results (color-coded labels) produced by different density-based clustering algorithms on the shapes of immune cells. Data were acquired by confocal microscopy and includes murine CD11c+ GFP immune cells in normal conditions. The 2d projection (MIP) is represented. A. CDP rodriguez2014clustering using an euclidean metric correctly separates touching cells but associates a piece of dendrite to the wrong cell. B. DBSCAN correctly reconstructs the shape of dendritic cells but is not able to separate touching cells with the same density-reachability criterion. C. Proposed method correctly associating the dendrite of the dendritic cell and separating touching cells. Black lines indicate the optimal path followed by the algorithm, from the cell centroid (density peak) to a point in the dendrite and in the touching region respectively.
  • Figure 4: Quantitative results produced by different density-based clustering algorithms CDProdriguez2014clustering, DBSCANDBSCAN, Proposed using a minimax path-cost function, Proposed using a trained model (Support Vector Machine). F1 score (A) is computed vs. the ground truth as $F1 = 2*\frac{Recall * Precision}{(Recall + Precision)}$, the Jaccard index (B) is computed vs. the ground truth as $\frac{TP}{TP+FN+FP}$, where TP are the True positives, Fn the False Negatives, FP the False Positives and TN the True Negatives. For the dataset 01_Chang a predictive model was trained on 50 desired paths and on 50 undesired paths (D) Which were randomly generated by the script in (Supplementary script 2). These paths were defined over a local window of 5 nodes using the density profile (i.e. an ordered vector of densities) as feature to describe the path (C). Using this constraint the proposed method achieved a F1 score $ge$ 0.99 and a Jaccard index $ge$ 0.98 on the difficult wiwie2015comparing Chang_01 example.
  • Figure 5: Tracking as a clustering problem. A. Quantitative results with respect to the ground truth (GT) produced by different density-based clustering algorithms CDP, proposed using a minimax path-cost function, proposed using a trained model (Support Vector Machine). B. F1 score and Jaccard Index.