GenAR: Next-Scale Autoregressive Generation for Spatial Gene Expression Prediction
Jiarui Ouyang, Yihui Wang, Yihang Gao, Yingxue Xu, Shu Yang, Hao Chen
TL;DR
GenAR tackles the cost and interpretability barriers of spatial transcriptomics by predicting spatial gene expression from H&E images using a discrete, multi-scale autoregressive framework. It groups genes into hierarchical clusters and performs codebook-free discrete token generation to predict raw counts, conditioning decoding on fused histological and spatial embeddings via AdaLN. The method factorizes the conditional distribution as $p(\mathbf{y}|\mathbf{H})=\prod_{k=1}^{K} p(\mathbf{y}^{(k)}|\mathbf{H},\mathbf{y}^{(<k)})$ and optimizes a multi-scale loss $\mathcal{L}_{\text{total}}=\frac{1}{K}\sum_{k=1}^{K}\mathcal{L}_k$, with intermediate KL terms and a final scale Gaussian likelihood $\sigma^2=\alpha\mu+\beta$. Empirically, GenAR achieves state-of-the-art performance on four Spatial Transcriptomics datasets, demonstrating robust cross-tissue generalization and the practical potential for cost-effective molecular profiling.
Abstract
Spatial Transcriptomics (ST) offers spatially resolved gene expression but remains costly. Predicting expression directly from widely available Hematoxylin and Eosin (H&E) stained images presents a cost-effective alternative. However, most computational approaches (i) predict each gene independently, overlooking co-expression structure, and (ii) cast the task as continuous regression despite expression being discrete counts. This mismatch can yield biologically implausible outputs and complicate downstream analyses. We introduce GenAR, a multi-scale autoregressive framework that refines predictions from coarse to fine. GenAR clusters genes into hierarchical groups to expose cross-gene dependencies, models expression as codebook-free discrete token generation to directly predict raw counts, and conditions decoding on fused histological and spatial embeddings. From an information-theoretic perspective, the discrete formulation avoids log-induced biases and the coarse-to-fine factorization aligns with a principled conditional decomposition. Extensive experimental results on four Spatial Transcriptomics datasets across different tissue types demonstrate that GenAR achieves state-of-the-art performance, offering potential implications for precision medicine and cost-effective molecular profiling. Code is publicly available at https://github.com/oyjr/genar.
