A Universal Non-Parametric Approach For Improved Molecular Sequence Analysis
Sarwan Ali, Tamkanat E Ali, Prakash Chourasia, Murray Patterson
TL;DR
The paper addresses molecular sequence classification with limited data and resources by replacing deep neural networks with a compression-based, non-parametric pipeline. It combines lossless compression, Normalized Compression Distance, and kernel methods to produce a distance matrix, which is converted to a Gaussian kernel and reduced via kernel PCA to obtain informative embeddings. The approach achieves state-of-the-art or competitive performance on real-world datasets (e.g., Human DNA) without heavy parameter tuning, demonstrating improved efficiency and scalability. This has practical impact for low-resource biology and rapid analysis contexts, offering an accessible alternative to pretrained language models and large neural nets.
Abstract
In the field of biological research, it is essential to comprehend the characteristics and functions of molecular sequences. The classification of molecular sequences has seen widespread use of neural network-based techniques. Despite their astounding accuracy, these models often require a substantial number of parameters and more data collection. In this work, we present a novel approach based on the compression-based Model, motivated from \cite{jiang2023low}, which combines the simplicity of basic compression algorithms like Gzip and Bz2, with Normalized Compression Distance (NCD) algorithm to achieve better performance on classification tasks without relying on handcrafted features or pre-trained models. Firstly, we compress the molecular sequence using well-known compression algorithms, such as Gzip and Bz2. By leveraging the latent structure encoded in compressed files, we compute the Normalized Compression Distance between each pair of molecular sequences, which is derived from the Kolmogorov complexity. This gives us a distance matrix, which is the input for generating a kernel matrix using a Gaussian kernel. Next, we employ kernel Principal Component Analysis (PCA) to get the vector representations for the corresponding molecular sequence, capturing important structural and functional information. The resulting vector representations provide an efficient yet effective solution for molecular sequence analysis and can be used in ML-based downstream tasks. The proposed approach eliminates the need for computationally intensive Deep Neural Networks (DNNs), with their large parameter counts and data requirements. Instead, it leverages a lightweight and universally accessible compression-based model.
