Table of Contents
Fetching ...

Clustering data by reordering them

Axel Descamps, Sélène Forget, Aliénor Lahlou, Claire Lavergne, Camille Berthelot, Guillaume Stirnemann, Rodolphe Vuilleumier, Nicolas Chéron

TL;DR

A new algorithm is proposed based on the simple idea that members from a family look like each other, and don't resemble elements foreign to the family, and is applied to sort biomolecules conformations, gene sequences, cells, images, and experimental conditions.

Abstract

Grouping elements into families to analyse them separately is a standard analysis procedure in many areas of sciences. We propose herein a new algorithm based on the simple idea that members from a family look like each other, and don't resemble elements foreign to the family. After reordering the data according to the distance between elements, the analysis is automatically performed with easily-understandable parameters. Noise is explicitly taken into account to deal with the variety of problems of a data-driven world. We applied the algorithm to sort biomolecules conformations, gene sequences, cells, images, and experimental conditions.

Clustering data by reordering them

TL;DR

A new algorithm is proposed based on the simple idea that members from a family look like each other, and don't resemble elements foreign to the family, and is applied to sort biomolecules conformations, gene sequences, cells, images, and experimental conditions.

Abstract

Grouping elements into families to analyse them separately is a standard analysis procedure in many areas of sciences. We propose herein a new algorithm based on the simple idea that members from a family look like each other, and don't resemble elements foreign to the family. After reordering the data according to the distance between elements, the analysis is automatically performed with easily-understandable parameters. Noise is explicitly taken into account to deal with the variety of problems of a data-driven world. We applied the algorithm to sort biomolecules conformations, gene sequences, cells, images, and experimental conditions.

Paper Structure

This paper contains 29 sections, 1 equation, 21 figures, 2 tables.

Figures (21)

  • Figure 1: Principles of the YACARE algorithm. (A) Reordering the data. (B) Moving stencil along the diagonal of the reordered matrix and plot of $\Delta_d$ along the diagonal (from the toy dataset). (C) Finding the optimal cut-off. (D, E, F) Final results on the toy dataset, where automatically found clusters are displayed in the plot of $\Delta_d$, in the the distance matrix and in the actual data (prior expansion of clusters).
  • Figure 2: Comparison of methods. (A) Toy dataset with nine sets of 500 points normally-distributed around their centers, and 900 points (20%) of randomly distributed points. (B) Clustering with the Ward's method, asking for nine clusters, (C) Clustering with the Gromos method, with a cut-off that was chosen to provide nine clusters, (D) Clustering with HDBSCAN, with a minimal cluster size corresping to 3.0% of the data, (E) Clustering with density peaks, with $\rho_{min}$ and $\delta_{min}$ chosen according to the decision graph, (F) Clustering with the YACARE method with default parameters for merging and expansion of clusters.
  • Figure 3: Comparison between YACARE and PCA. (A-B-C) Data points from the RNA structure plotted on two PCA components, and colored according to their clusters. PCA and clustering were performed on the global structure. (D) Local analysis of the attacking angle for the chemical reaction, coming from a 250 ns simulation. The optimal angle is around 140$^{\circ}$ (pink cluster). Data were ordered by YACARE and are colored according to their clusters.
  • Figure S1: Reordering a matrix, and proposed tools to compare clusters. (A) Original and reordered distance matrices from the toy dataset. (B) The size, mean distance and standard deviation of the distance in all clusters and off-diagonal zones are displayed. The user can choose to display in each square the mean distance, the standard deviation, both, or none. (C) All clusters are identified as red squares, and off-diagonal zones as dashed-purple rectangles. The values for the zones on panel (B) match the zones on panel (C).
  • Figure S2: Comparing different starting points for reordering the distance matrix. The reordering of the distance matrix with six starting points is compared, and the integral of $\Delta_d$ is provided. First row: reordering according to the first element, to the element that provides the lowest value for the integral of $\Delta_d$, to the element that provides the highest value for the integral of $\Delta_d$. Second row: reordering according to three random elements.
  • ...and 16 more figures