Table of Contents
Fetching ...

HistoAtlas: A Pan-Cancer Morphology Atlas Linking Histomics to Molecular Programs and Clinical Outcomes

Pierre-Antoine Bannier

Abstract

We present HistoAtlas, a pan-cancer computational atlas that extracts 38 interpretable histomic features from 6,745 diagnostic H&E slides across 21 TCGA cancer types and systematically links every feature to survival, gene expression, somatic mutations, and immune subtypes. All associations are covariate-adjusted, multiple-testing corrected, and classified into evidence-strength tiers. The atlas recovers known biology, from immune infiltration and prognosis to proliferation and kinase signaling, while uncovering compartment-specific immune signals and morphological subtypes with divergent outcomes. Every result is spatially traceable to tissue compartments and individual cells, statistically calibrated, and openly queryable. HistoAtlas enables systematic, large-scale biomarker discovery from routine H&E without specialized staining or sequencing. Data and an interactive web atlas are freely available at https://histoatlas.com .

HistoAtlas: A Pan-Cancer Morphology Atlas Linking Histomics to Molecular Programs and Clinical Outcomes

Abstract

We present HistoAtlas, a pan-cancer computational atlas that extracts 38 interpretable histomic features from 6,745 diagnostic H&E slides across 21 TCGA cancer types and systematically links every feature to survival, gene expression, somatic mutations, and immune subtypes. All associations are covariate-adjusted, multiple-testing corrected, and classified into evidence-strength tiers. The atlas recovers known biology, from immune infiltration and prognosis to proliferation and kinase signaling, while uncovering compartment-specific immune signals and morphological subtypes with divergent outcomes. Every result is spatially traceable to tissue compartments and individual cells, statistically calibrated, and openly queryable. HistoAtlas enables systematic, large-scale biomarker discovery from routine H&E without specialized staining or sequencing. Data and an interactive web atlas are freely available at https://histoatlas.com .
Paper Structure (32 sections, 1 equation, 6 figures, 10 tables)

This paper contains 32 sections, 1 equation, 6 figures, 10 tables.

Figures (6)

  • Figure 1: The HistoAtlas pipeline and pan-cancer morphological landscape. (a) Overview of the computational pipeline. Diagnostic H&E-stained whole-slide images (6745.0 slides, 21 TCGA cancer types) are segmented into tissue compartments (tumor core, stroma, invasive front), followed by cell-level detection and classification of 9 cell types. From each slide, 38 quantitative histomic features are extracted spanning tissue composition, cell densities, nuclear morphology, spatial immune topology, microenvironment heterogeneity, and cell-type ratios. (b) Pairwise Spearman correlation matrix of the 38 features computed across all 6745.0 slides. Ward-linkage hierarchical clustering reveals structured modules: density features form a tight positive-correlation block, morphology features cluster together, and cross-module anti-correlations delineate distinct biological axes. Left color bar indicates feature category. Diagonal entries are masked. (c) UMAP embedding of all 6745.0 slides colored by cancer type. Cancer types with distinct morphological programs (e.g., THYM, THCA) occupy separated regions, while adenocarcinomas (BRCA, LUAD, STAD) partially overlap. Gray contour lines indicate point density. (d) Cancer type composition of each L1 morphological cluster ($K = 10$, horizontal stacked bars), with cluster sizes indicated at left. Cancer types constituting more than 10% of a cluster are labeled within the bar. Right annotation shows overall survival direction per cluster (green arrow: significantly protective, HR $< 1$; red arrow: significantly adverse, HR $> 1$; gray dash: non-significant). (e) Heatmap of z-scored mean feature values per cluster, with Ward-linkage hierarchical clustering applied to both features (rows) and clusters (columns). Feature labels are colored by category. Red indicates elevated values; blue indicates suppressed values relative to the pan-cancer mean. Values are clipped at $z = \pm 2$ for visualization. $N = 6745.0$ slides from 21 TCGA cancer types.
  • Figure 2: Spatial immune topology reveals compartment-specific prognostic effects. (a) Forest plot of hazard ratios (overall survival, covariate-adjusted Cox regression [age, sex, stage; stratified by TSS]) for intratumoral lymphocyte density (blue circles) and stromal lymphocyte density (orange diamonds) across cancer types and the pan-cancer cohort ($N = 4560.0$). Filled markers indicate moderate or strong evidence (BH-adjusted $P < 0.05$ with adequate power); hollow markers indicate suggestive or insufficient evidence. Intratumoral lymphocyte density is protective (HR $= 0.87$$[0.81, 0.93]$, $P_{\mathrm{adj}} = 9.8 \times 10^{-4}$); stromal lymphocyte density shows a weaker protective effect (HR $= 0.89$$[0.83, 0.97]$, $P_{\mathrm{adj}} = 0.031$). Error bars represent 95% confidence intervals. Vertical dashed line indicates HR $= 1.0$ (null). (b) Kaplan--Meier curves for intratumoral lymphocyte density in BRCA (median split, $N = 960$; High: 480, Low: 480), showing a protective association (HR $= 0.72$$[0.60, 0.88]$, $P_{\mathrm{adj}} = 0.018$). Shaded areas indicate 95% confidence intervals. Number at risk shown below. (c) Tumor--lymphocyte nearest-neighbor distance at the invasive front inversely correlates with CD8A expression in BRCA (Spearman $\rho = -0.53$, $P_{\mathrm{adj}} = 1.8 \times 10^{-68}$, $N = 958$), demonstrating that spatial immune exclusion detected by histomics corresponds to reduced cytotoxic T-cell gene expression. Per-slide feature values averaged per case; both axes z-scored within BRCA. (d) Top gene correlates of intratumoral lymphocyte density in BRCA ($N = 953$, adjusted model). Horizontal bar chart showing the top 10 positive and top 5 negative Spearman correlations among significantly associated genes (BH-adjusted $P < 0.05$). Immune checkpoint genes (TIGIT, PDCD1, CTLA4) and T-cell markers (CD3E, CD3D, CD8A, CD8B) dominate the positive correlates, validating the biological identity of the histomic feature. Error bars represent 95% bootstrap confidence intervals. All $P$-values were calculated using Cox proportional hazards regression (a, b) or Spearman correlation with analytical $t$-test (c, d), with Benjamini--Hochberg correction for multiple testing within each cancer type.
  • Figure 3: Morphological features recapitulate molecular programs. (a) Heatmap of mean Spearman correlation (across 21 cancer types) between 38 histomic features and 50 Hallmark pathway scores (unadjusted model). Rows and columns are hierarchically clustered (Ward linkage). Left color bar indicates pathway category (Immune, Proliferation, Signaling, Metabolic, Other); top color bar indicates histomic feature category (Composition, Density, Morphology, Spatial, Heterogeneity, Ratios). Structured correspondence is evident: immune cell density features cluster with immune pathway signatures; nuclear morphology and mitotic features cluster with cell cycle and proliferation pathways; invasion depth aligns with EMT. Colormap: RdBu_r, clipped at $\rho = \pm 0.3$. (b) Effect-size distributions for pan-cancer adjusted-model associations, stratified by molecular data type. Each histogram shows the distribution of Spearman $\rho$ values; colored bars indicate significance at FDR $< 0.05$, gray bars indicate non-significant associations. Vertical dashed lines at $\rho = \pm 0.3$. Gene expression: 4371.0/5453.0 significant (80%); Hallmark pathways: 1692.0/2050.0 (83%); copy-number variation: 2845.0/5453.0 (52%). The higher significance rate among pathway and expression associations, and the broader $\rho$ distributions, reflect stronger morphology--transcriptomic coupling than morphology--genomic coupling. All correlations are Spearman with analytical $P$-values ($t$-distribution approximation) and Benjamini--Hochberg correction.
  • Figure 4: Morphological clusters map to distinct molecular archetypes. (a) Immune subtype enrichment per morphological cluster (L1, pan-cancer). Heatmap of $\log_2(\mathrm{OR})$ from Fisher's exact tests comparing the proportion of each Thorsson immune subtype within each cluster to the remaining cohort. Rows: 10 morphological clusters (labeled with cluster name and dominant cancer type). Columns: five Thorsson immune subtypes (C1 Wound Healing, C2 IFN-$\gamma$ Dominant, C3 Inflammatory, C4 Lymphocyte Depleted, C6 TGF-$\beta$ Dominant). Color scale: red--blue diverging, centered at 0. Black dots indicate BH-adjusted $P < 0.05$. Cluster 6 (CRC-enriched) is dominated by C1 Wound Healing (OR $= 5.59$, $P_{\mathrm{adj}} < 10^{-88}$); Cluster 2 shows combined C4 Lymphocyte Depleted (OR $= 7.14$) and C3 Inflammatory (OR $= 4.99$) enrichment (85% combined); Cluster 8 (hormone-driven) is enriched for C3 Inflammatory. (b) Hallmark pathway enrichment (Cliff's $\delta$ from Mann--Whitney $U$ tests) per morphological cluster. Rows: 50 Hallmark pathways, hierarchically clustered (Ward linkage). Columns: 10 morphological clusters. Black dots indicate BH-adjusted $P < 0.05$. Cluster 4 (THYM-enriched) shows strong immune rejection pathway enrichment; Cluster 8 shows estrogen response enrichment ($\delta = 0.52$) with suppressed proliferation ($\delta = -0.51$); Cluster 6 shows Wnt/$\beta$-catenin enrichment ($\delta = 0.46$) consistent with CRC composition. Colormap centered at 0, range $[-0.5, 0.5]$.
  • Figure 5: Statistical framework and quality control. (a) PVCA variance decomposition per cancer type showing proportions of variance attributable to batch effects (tissue source site, red) and residual signal (gray). Pan-cancer analysis attributes 44.7% of variance to batch (TSS), 32.7% to biological signal (cancer type), and 22.6% to residual. Within individual cancer types, batch effects account for a median of 20.6% of variance. All per-cancer silhouette scores by TSS are negative (range $-0.18$ to $-0.0004$), confirming minimal batch-driven clustering. (b) Minimum detectable effect size (MDES) for harmful hazard ratios across 21 cancer types, ordered by sample size (ascending from bottom). Box plots show the distribution of MDES across features within each cancer type. Well-powered cancer types (BRCA, $N = 960$) can detect HR $\geq 1.62$; underpowered types (CHOL, $N = 36$) require HR $\geq 3.75$ for 80% power. Dashed line indicates the clinically meaningful threshold (HR $= 1.5$). $N$ values per panel: (a) 6745.0 slides, 21 cancer types; (b) 5623.0 survival associations across 22 cohorts.
  • ...and 1 more figures