Auxiliary Gene Learning: Spatial Gene Expression Estimation by Auxiliary Gene Selection
Kaito Shiku, Kazuya Nishimura, Shinnosuke Matsuo, Yasuhiro Kojima, Ryoma Bise
TL;DR
This work tackles the challenge of noisy spatial gene expression estimation by leveraging otherwise ignored genes as auxiliary supervision. It introduces Auxiliary Gene Learning (AGL) and a scalable Prior-Knowledge-Based Differentiable Top-$k$ Gene Selection via Bi-level Optimization (DkGSB) that ranks ~20{,}000 auxiliary candidates with HVG scores and learns a differentiable top-$k$ mask to select a productive subset. Across intra- and inter-batch experiments on spatial-transcriptomics datasets, AGL with DkGSB consistently improves primary target gene prediction (Pearson correlation) over conventional single-task learning and standard auxiliary-task baselines. Notably, DkGSB discards a large portion of auxiliaries (e.g., ~82% in some settings, keeping ~18%, such as 2{,}698 of 15{,}000) while achieving superior performance and demonstrating robustness to varying numbers of primary genes. The method is model-agnostic and has practical implications for enhancing ST-based predictions by efficiently harnessing broad auxiliary information.
Abstract
Spatial transcriptomics (ST) is a novel technology that enables the observation of gene expression at the resolution of individual spots within pathological tissues. ST quantifies the expression of tens of thousands of genes in a tissue section; however, heavy observational noise is often introduced during measurement. In prior studies, to ensure meaningful assessment, both training and evaluation have been restricted to only a small subset of highly variable genes, and genes outside this subset have also been excluded from the training process. However, since there are likely co-expression relationships between genes, low-expression genes may still contribute to the estimation of the evaluation target. In this paper, we propose $Auxiliary \ Gene \ Learning$ (AGL) that utilizes the benefit of the ignored genes by reformulating their expression estimation as auxiliary tasks and training them jointly with the primary tasks. To effectively leverage auxiliary genes, we must select a subset of auxiliary genes that positively influence the prediction of the target genes. However, this is a challenging optimization problem due to the vast number of possible combinations. To overcome this challenge, we propose Prior-Knowledge-Based Differentiable Top-$k$ Gene Selection via Bi-level Optimization (DkGSB), a method that ranks genes by leveraging prior knowledge and relaxes the combinatorial selection problem into a differentiable top-$k$ selection problem. The experiments confirm the effectiveness of incorporating auxiliary genes and show that the proposed method outperforms conventional auxiliary task learning approaches.
