Table of Contents
Fetching ...

QuST-LLM: Integrating Large Language Models for Comprehensive Spatial Transcriptomics Analysis

Chao Hui Huang

TL;DR

QuST-LLM addresses the interpretability challenge of spatial transcriptomics by translating high dimensional spatial gene expression into human readable biological narratives. It extends QuPath via QuST to provide end to end ST data loading, ROI selection, GO enrichment analysis, and LLM driven interpretation. The framework supports forward analysis based on key genes and comparative expression, as well as backward analysis that maps natural language descriptions to spatial regions, validated by examples using GPT-4 and GOATOOLS. Quantitative validation includes ROC AUC performance (e.g., 0.94) demonstrating strong alignment between language prompts and spatial patterns, underscoring the tool's potential to enhance interpretability and accessibility in spatial biology.

Abstract

In this paper, we introduce QuST-LLM, an innovative extension of QuPath that utilizes the capabilities of large language models (LLMs) to analyze and interpret spatial transcriptomics (ST) data. In addition to simplifying the intricate and high-dimensional nature of ST data by offering a comprehensive workflow that includes data loading, region selection, gene expression analysis, and functional annotation, QuST-LLM employs LLMs to transform complex ST data into understandable and detailed biological narratives based on gene ontology annotations, thereby significantly improving the interpretability of ST data. Consequently, users can interact with their own ST data using natural language. Hence, QuST-LLM provides researchers with a potent functionality to unravel the spatial and functional complexities of tissues, fostering novel insights and advancements in biomedical research. QuST-LLM is a part of QuST project. The source code is hosted on GitHub and documentation is available at (https://github.com/huangch/qust).

QuST-LLM: Integrating Large Language Models for Comprehensive Spatial Transcriptomics Analysis

TL;DR

QuST-LLM addresses the interpretability challenge of spatial transcriptomics by translating high dimensional spatial gene expression into human readable biological narratives. It extends QuPath via QuST to provide end to end ST data loading, ROI selection, GO enrichment analysis, and LLM driven interpretation. The framework supports forward analysis based on key genes and comparative expression, as well as backward analysis that maps natural language descriptions to spatial regions, validated by examples using GPT-4 and GOATOOLS. Quantitative validation includes ROC AUC performance (e.g., 0.94) demonstrating strong alignment between language prompts and spatial patterns, underscoring the tool's potential to enhance interpretability and accessibility in spatial biology.

Abstract

In this paper, we introduce QuST-LLM, an innovative extension of QuPath that utilizes the capabilities of large language models (LLMs) to analyze and interpret spatial transcriptomics (ST) data. In addition to simplifying the intricate and high-dimensional nature of ST data by offering a comprehensive workflow that includes data loading, region selection, gene expression analysis, and functional annotation, QuST-LLM employs LLMs to transform complex ST data into understandable and detailed biological narratives based on gene ontology annotations, thereby significantly improving the interpretability of ST data. Consequently, users can interact with their own ST data using natural language. Hence, QuST-LLM provides researchers with a potent functionality to unravel the spatial and functional complexities of tissues, fostering novel insights and advancements in biomedical research. QuST-LLM is a part of QuST project. The source code is hosted on GitHub and documentation is available at (https://github.com/huangch/qust).
Paper Structure (13 sections, 7 figures, 1 algorithm)

This paper contains 13 sections, 7 figures, 1 algorithm.

Figures (7)

  • Figure 1: The QuST-LLM workflow for forward analysis includes the following steps: (a), users begin by importing ST data into QuPath using QuST. This step may require additional spatial alignment data, which can be obtained via FIJI if the user is working with a 10x Xenium dataset (see text for more details). Once the ST data is successfully loaded, users can perform analysis and visualization using QuPath and QuST. (b), QuST-LLM takes the objects selected by the user, including single-cell clusters or regions, performs a series of single-cell data preprocessing steps and then obtains a list of GO terms based on GOEA. (c), the spatial data and GO terms are integrated as biological evidence, which can be interpreted using an LLM service. The final outcomes is presented to the users.
  • Figure 2: The QuST-LLM workflow for backward analysis includes the following steps: (a), users begin by providing languages describing the required biological evidences. A LLM service is then interpreting the inputs and obtains the the key terms which may be used to isolate the sub-graph of the GO. (b), QuST-LLM identifies the key genes by using GOEA based on the obtained GO terms. (c), given the ST data which has been loaded into QuST, the users can then identify the cells which may highly relevant to the sentences provided by the users.
  • Figure 3: Two approaches for obtaining key genes.
  • Figure 4: LLM interpretation of high ranking genes based on the selected immuno-cell clusters. (a) The provided whole slide image (WSI) with highlighted single-cell clusters indicated by yellow spots. (b) The results of GOEA, with the x-axis representing the ratio of relevant genes and relevant GO terms, the y-axis showing the list of identified GO terms sorted by uncorrected p-values, and the heat map represents the corresponding p-value for each GO term. (c) The interpretation of the selected immuno-cell clusters as determined by LLM.
  • Figure 5: LLM interpretation of the selected epithelial/tumor-epithelial single-cell clusters based on high ranked gene expression. (a) The provided whole slide image (WSI) with highlighted epithelial/tumor-epithelial single-cell clusters indicated by yellow markers. (b) The result of GOEA, with the x-axis representing the ratio of relevant genes and relevant GO terms, and the y-axis showing the list of identified GO terms sorted by uncorrected p-values. The heat map represents the corresponding p-value for each GO term. (c) The interpretation generated by LLM.
  • ...and 2 more figures