Table of Contents
Fetching ...

Auxiliary Gene Learning: Spatial Gene Expression Estimation by Auxiliary Gene Selection

Kaito Shiku, Kazuya Nishimura, Shinnosuke Matsuo, Yasuhiro Kojima, Ryoma Bise

TL;DR

This work tackles the challenge of noisy spatial gene expression estimation by leveraging otherwise ignored genes as auxiliary supervision. It introduces Auxiliary Gene Learning (AGL) and a scalable Prior-Knowledge-Based Differentiable Top-$k$ Gene Selection via Bi-level Optimization (DkGSB) that ranks ~20{,}000 auxiliary candidates with HVG scores and learns a differentiable top-$k$ mask to select a productive subset. Across intra- and inter-batch experiments on spatial-transcriptomics datasets, AGL with DkGSB consistently improves primary target gene prediction (Pearson correlation) over conventional single-task learning and standard auxiliary-task baselines. Notably, DkGSB discards a large portion of auxiliaries (e.g., ~82% in some settings, keeping ~18%, such as 2{,}698 of 15{,}000) while achieving superior performance and demonstrating robustness to varying numbers of primary genes. The method is model-agnostic and has practical implications for enhancing ST-based predictions by efficiently harnessing broad auxiliary information.

Abstract

Spatial transcriptomics (ST) is a novel technology that enables the observation of gene expression at the resolution of individual spots within pathological tissues. ST quantifies the expression of tens of thousands of genes in a tissue section; however, heavy observational noise is often introduced during measurement. In prior studies, to ensure meaningful assessment, both training and evaluation have been restricted to only a small subset of highly variable genes, and genes outside this subset have also been excluded from the training process. However, since there are likely co-expression relationships between genes, low-expression genes may still contribute to the estimation of the evaluation target. In this paper, we propose $Auxiliary \ Gene \ Learning$ (AGL) that utilizes the benefit of the ignored genes by reformulating their expression estimation as auxiliary tasks and training them jointly with the primary tasks. To effectively leverage auxiliary genes, we must select a subset of auxiliary genes that positively influence the prediction of the target genes. However, this is a challenging optimization problem due to the vast number of possible combinations. To overcome this challenge, we propose Prior-Knowledge-Based Differentiable Top-$k$ Gene Selection via Bi-level Optimization (DkGSB), a method that ranks genes by leveraging prior knowledge and relaxes the combinatorial selection problem into a differentiable top-$k$ selection problem. The experiments confirm the effectiveness of incorporating auxiliary genes and show that the proposed method outperforms conventional auxiliary task learning approaches.

Auxiliary Gene Learning: Spatial Gene Expression Estimation by Auxiliary Gene Selection

TL;DR

This work tackles the challenge of noisy spatial gene expression estimation by leveraging otherwise ignored genes as auxiliary supervision. It introduces Auxiliary Gene Learning (AGL) and a scalable Prior-Knowledge-Based Differentiable Top- Gene Selection via Bi-level Optimization (DkGSB) that ranks ~20{,}000 auxiliary candidates with HVG scores and learns a differentiable top- mask to select a productive subset. Across intra- and inter-batch experiments on spatial-transcriptomics datasets, AGL with DkGSB consistently improves primary target gene prediction (Pearson correlation) over conventional single-task learning and standard auxiliary-task baselines. Notably, DkGSB discards a large portion of auxiliaries (e.g., ~82% in some settings, keeping ~18%, such as 2{,}698 of 15{,}000) while achieving superior performance and demonstrating robustness to varying numbers of primary genes. The method is model-agnostic and has practical implications for enhancing ST-based predictions by efficiently harnessing broad auxiliary information.

Abstract

Spatial transcriptomics (ST) is a novel technology that enables the observation of gene expression at the resolution of individual spots within pathological tissues. ST quantifies the expression of tens of thousands of genes in a tissue section; however, heavy observational noise is often introduced during measurement. In prior studies, to ensure meaningful assessment, both training and evaluation have been restricted to only a small subset of highly variable genes, and genes outside this subset have also been excluded from the training process. However, since there are likely co-expression relationships between genes, low-expression genes may still contribute to the estimation of the evaluation target. In this paper, we propose (AGL) that utilizes the benefit of the ignored genes by reformulating their expression estimation as auxiliary tasks and training them jointly with the primary tasks. To effectively leverage auxiliary genes, we must select a subset of auxiliary genes that positively influence the prediction of the target genes. However, this is a challenging optimization problem due to the vast number of possible combinations. To overcome this challenge, we propose Prior-Knowledge-Based Differentiable Top- Gene Selection via Bi-level Optimization (DkGSB), a method that ranks genes by leveraging prior knowledge and relaxes the combinatorial selection problem into a differentiable top- selection problem. The experiments confirm the effectiveness of incorporating auxiliary genes and show that the proposed method outperforms conventional auxiliary task learning approaches.

Paper Structure

This paper contains 17 sections, 6 equations, 6 figures, 2 tables, 1 algorithm.

Figures (6)

  • Figure 1: (a) Conventional gene expression estimation focuses solely on predicting primary genes, typically ignoring the remaining ones. In this study, we treat these overlooked genes as auxiliary genes. (b) Effectiveness of $Auxiliary \ Gene \ Learning$ (AGL). PGL denotes primary gene learning, which uses only the target gene for training. AGL represents our auxiliary gene learning, which jointly estimates primary genes and previously ignored auxiliary genes, selecting auxiliaries via a differentiable cut-off. (c) Illustration of our top-k gene selection approach. As the number of possible subsets exceeds $10^{6000}$, we relax this combinatorial selection into a top-$k$ problem by leveraging prior knowledge of gene-expression signal quality.
  • Figure 2: Overview of Proposed DkGSB. The procedure has three steps: (i) auxiliary genes are ranked based on a variance-based score $\bm{s}$; (ii) a single learnable scalar $k$ defines a soft top-$k$ mask $\bm{\lambda}(k)$, obtained through a differentiable relaxation of the hard cut-off; (iii) $k$ is optimized together with the network weights by a bi-level scheme.
  • Figure 3: Reasonability of HVG score-based selection. Performance of primary gene expression estimation using HVG score–based selection (blue) and random selection (orange). The vertical axis shows the performance difference between models trained with auxiliary genes selected by each method and those trained using only primary genes ("PGL"), while the horizontal axis indicates the number of selected auxiliary genes. The experiments were conducted using BOWEL B.
  • Figure 4: Behavior during cut-off $k$ optimization. The left panel shows the changes in validation performance during the optimization process of the $outer$ loop, while the right panel shows the changes in the cut-off $k$. The experiments were conducted using HEART dataset.
  • Figure 5: Visualization of the expression levels for the selected auxiliary genes. In the top row, the expression patterns of genes that were selected by the proposed method but not by " AGL+AMAL," and in the bottom row, the opposite: genes that were not selected by the proposed method but were selected by " AGL+AMAL." The name of the visualized gene is shown at the top of each slide.
  • ...and 1 more figures