Table of Contents
Fetching ...

Disturbance-based Discretization, Differentiable IDS Channel, and an IDS-Correcting Code for DNA-based Storage

Alan J. X. Guo, Mengyi Wei, Yufan Dai, Yali Wei, Pengchen Zhang

TL;DR

An autoencoder-based approach, THEA-code, aimed at efficiently generating IDS-correcting codes for complex IDS channels, producing channel-customized IDS-correcting codes that demonstrate commendable performance across complex IDS channels, particularly in realistic DNA-based storage channels.

Abstract

With recent advancements in next-generation data storage, especially in biological molecule-based storage, insertion, deletion, and substitution (IDS) error-correcting codes have garnered increased attention. However, a universal method for designing tailored IDS-correcting codes across varying channel settings remains underexplored. We present an autoencoder-based approach, THEA-code, aimed at efficiently generating IDS-correcting codes for complex IDS channels. In the work, a disturbance-based discretization is proposed to discretize the features of the autoencoder, and a simulated differentiable IDS channel is developed as a differentiable alternative for IDS operations. These innovations facilitate the successful convergence of the autoencoder, producing channel-customized IDS-correcting codes that demonstrate commendable performance across complex IDS channels, particularly in realistic DNA-based storage channels.

Disturbance-based Discretization, Differentiable IDS Channel, and an IDS-Correcting Code for DNA-based Storage

TL;DR

An autoencoder-based approach, THEA-code, aimed at efficiently generating IDS-correcting codes for complex IDS channels, producing channel-customized IDS-correcting codes that demonstrate commendable performance across complex IDS channels, particularly in realistic DNA-based storage channels.

Abstract

With recent advancements in next-generation data storage, especially in biological molecule-based storage, insertion, deletion, and substitution (IDS) error-correcting codes have garnered increased attention. However, a universal method for designing tailored IDS-correcting codes across varying channel settings remains underexplored. We present an autoencoder-based approach, THEA-code, aimed at efficiently generating IDS-correcting codes for complex IDS channels. In the work, a disturbance-based discretization is proposed to discretize the features of the autoencoder, and a simulated differentiable IDS channel is developed as a differentiable alternative for IDS operations. These innovations facilitate the successful convergence of the autoencoder, producing channel-customized IDS-correcting codes that demonstrate commendable performance across complex IDS channels, particularly in realistic DNA-based storage channels.
Paper Structure (30 sections, 1 theorem, 26 equations, 12 figures, 9 tables)

This paper contains 30 sections, 1 theorem, 26 equations, 12 figures, 9 tables.

Key Result

Proposition 3.1

By introducing disturbance to a non-generative autoencoder's feature logits $\bm{x}$ via $\mathrm{GS}(\bm{x})$, the autoencoder, upon non-trivial convergence, produces confident logits $\bm{x}$, resulting in one-hot-like probabilities $\bm{\pi}$.

Figures (12)

  • Figure 1: The differentiable IDS channel. The $\hat{\bm{C}}_{\mathrm{DIDS}}$ and $\hat{\bm{C}}_{\mathrm{CIDS}}$ are generated by the differentiable and conventional IDS channels, respectively. Optimizing the difference between $\hat{\bm{C}}_{\mathrm{DIDS}}$ and $\hat{\bm{C}}_{\mathrm{CIDS}}$ trains the differentiable channel.
  • Figure 2: The flowchart of THEA-Code, including the encoder, the pretrained IDS channel, and the decoder. All of these modules are implemented using Transformer-based models. The "GS" is where the disturbance based discretization applied in the pipeline.
  • Figure 3: The accuracy of the differentiable IDS channel under various channel error rates. Accuracy is calculated by comparing the outputs of the differentiable IDS channel with those of the conventional IDS channel.
  • Figure 4: The error rates of the comparison experiments. Results for Cai, DNA-LM, HEDGES, and THEA-Code are shown across $\mathrm{Hom}$, $\mathrm{Asc}$, and $\mathrm{Des}$ channels, with respect to their code rates.
  • Figure 5: The reconstruction loss, codeword entropy, and validation NER comparing the Gumbel-Softmax setting against a vanilla softmax approach. 5 runs were recorded.
  • ...and 7 more figures

Theorems & Definitions (2)

  • Proposition 3.1
  • proof : Brief proof: