Table of Contents
Fetching ...

ADRS-CNet: An adaptive dimensionality reduction selection and classification network for DNA storage clustering algorithms

Bowen Liu, Jiankun Li

TL;DR

Training a multilayer perceptron model to classify input DNA sequence features and adaptively select the most suitable dimensionality reduction method to enhance subsequent clustering results demonstrates that this approach effectively mitigates the impact of the curse of dimensionality on clustering models.

Abstract

DNA storage technology offers new possibilities for addressing massive data storage due to its high storage density, long-term preservation, low maintenance cost, and compact size. To improve the reliability of stored information, base errors and missing storage sequences are challenges that must be faced. Currently, clustering and comparison of sequenced sequences are employed to recover the original sequence information as much as possible. Nonetheless, extracting DNA sequences of different lengths as features leads to the curse of dimensionality, which needs to be overcome. To address this, techniques like PCA, UMAP, and t-SNE are commonly employed to project high-dimensional features into low-dimensional space. Considering that these methods exhibit varying effectiveness in dimensionality reduction when dealing with different datasets, this paper proposes training a multilayer perceptron model to classify input DNA sequence features and adaptively select the most suitable dimensionality reduction method to enhance subsequent clustering results. Through testing on open-source datasets and comparing our approach with various baseline methods, experimental results demonstrate that our model exhibits superior classification performance and significantly improves clustering outcomes. This displays that our approach effectively mitigates the impact of the curse of dimensionality on clustering models.

ADRS-CNet: An adaptive dimensionality reduction selection and classification network for DNA storage clustering algorithms

TL;DR

Training a multilayer perceptron model to classify input DNA sequence features and adaptively select the most suitable dimensionality reduction method to enhance subsequent clustering results demonstrates that this approach effectively mitigates the impact of the curse of dimensionality on clustering models.

Abstract

DNA storage technology offers new possibilities for addressing massive data storage due to its high storage density, long-term preservation, low maintenance cost, and compact size. To improve the reliability of stored information, base errors and missing storage sequences are challenges that must be faced. Currently, clustering and comparison of sequenced sequences are employed to recover the original sequence information as much as possible. Nonetheless, extracting DNA sequences of different lengths as features leads to the curse of dimensionality, which needs to be overcome. To address this, techniques like PCA, UMAP, and t-SNE are commonly employed to project high-dimensional features into low-dimensional space. Considering that these methods exhibit varying effectiveness in dimensionality reduction when dealing with different datasets, this paper proposes training a multilayer perceptron model to classify input DNA sequence features and adaptively select the most suitable dimensionality reduction method to enhance subsequent clustering results. Through testing on open-source datasets and comparing our approach with various baseline methods, experimental results demonstrate that our model exhibits superior classification performance and significantly improves clustering outcomes. This displays that our approach effectively mitigates the impact of the curse of dimensionality on clustering models.
Paper Structure (24 sections, 11 equations, 8 figures, 6 tables)

This paper contains 24 sections, 11 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: Major Processes of DNA Storage
  • Figure 2: The framework for ADRS-CNet
  • Figure 3: 100 to 199 Clustering accuracy with different dimensions
  • Figure 4: 100 to 199 Clustering accuracy with categorical axis
  • Figure 5: 9800 to 9899 clustering accuracy with different dimensions
  • ...and 3 more figures