Boosting t-SNE Efficiency for Sequencing Data: Insights from Kernel Selection
Avais Jan, Prakash Chourasia, Sarwan Ali, Murray Patterson
TL;DR
The paper investigates kernel selection for t-SNE applied to high-dimensional biological sequences, evaluating nine kernels across three embedding schemes (OHE, Spike2Vec, Minimizers) and contrasting subjective visualization with objective neighborhood-preservation metrics. It demonstrates that cosine similarity consistently delivers superior runtime efficiency and neighborhood preservation compared to Gaussian and isolation kernels, while retaining robust performance across diverse datasets and embedding methods. Beyond visualization, the work shows kernel choice significantly affects downstream classification and clustering, providing practical guidance for scalable, large-scale sequence analysis pipelines. These insights offer a data-dependent perspective on kernel design for t-SNE in genomics, enabling more reliable exploratory data analysis and downstream ML tasks.
Abstract
Dimensionality reduction techniques are essential for visualizing and analyzing high-dimensional biological sequencing data. t-distributed Stochastic Neighbor Embedding (t-SNE) is widely used for this purpose, traditionally employing the Gaussian kernel to compute pairwise similarities. However, the Gaussian kernel's lack of data-dependence and computational overhead limit its scalability and effectiveness for categorical biological sequences. Recent work proposed the isolation kernel as an alternative, yet it may not optimally capture sequence similarities. In this study, we comprehensively evaluate nine different kernel functions for t-SNE applied to molecular sequences, using three embedding methods: One-Hot Encoding, Spike2Vec, and minimizers. Through both subjective visualization and objective metrics (including neighborhood preservation scores), we demonstrate that the cosine similarity kernel in general outperforms other kernels, including Gaussian and isolation kernels, achieving superior runtime efficiency and better preservation of pairwise distances in low-dimensional space. We further validate our findings through extensive classification and clustering experiments across six diverse biological datasets (Spike7k, Host, ShortRead, Rabies, Genome, and Breast Cancer), employing multiple machine learning algorithms and evaluation metrics. Our results show that kernel selection significantly impacts not only visualization quality but also downstream analytical tasks, with the cosine similarity kernel providing the most robust performance across different data types and embedding strategies, making it particularly suitable for large-scale biological sequence analysis.
