Table of Contents
Fetching ...

GDC Cohort Copilot: An AI Copilot for Curating Cohorts from the Genomic Data Commons

Steven Song, Anirudh Subramanyam, Zhenyu Zhang, Aarti Venkat, Robert L. Grossman

TL;DR

<3-5 sentence high-level summary> The paper addresses the challenge of constructing precise GDC cohorts from hundreds of filter properties by introducing GDC Cohort Copilot, an open-source system that translates natural-language cohort descriptions into structured GDC cohort filter JSON and exposes an interactive, exportable Gradio interface. Central to the approach is GDC Cohort LLM, trained on paired real user filters and large-scale synthetic data to generate accurate cohort filters, outperforming GPT-4o in their evaluations. The solution combines a core 68-filter-property schema with a containerized web app for easy deployment on HuggingFace Spaces, enabling researchers to curate cohorts via natural language and refine them before exporting to the GDC. Key results show that synthetic data mixtures significantly improve performance, with the final open-source model achieving higher TPR, IoU, Exact, and BERTScore metrics than the GPT-4o baseline. The work provides an accessible, locally-served tool to streamline cancer genomics cohort curation and broadens access through open-source tooling and containerization.</p>

Abstract

The Genomic Data Commons (GDC) provides access to high quality, harmonized cancer genomics data through a unified curation and analysis platform centered around patient cohorts. While GDC users can interactively create complex cohorts through the graphical Cohort Builder, users (especially new ones) may struggle to find specific cohort descriptors across hundreds of possible fields and properties. However, users may be better able to describe their desired cohort in free-text natural language. We introduce GDC Cohort Copilot, an open-source copilot tool for curating cohorts from the GDC. GDC Cohort Copilot automatically generates the GDC cohort filter corresponding to a user-input natural language description of their desired cohort, before exporting the cohort back to the GDC for further analysis. An interactive user interface allows users to further refine the generated cohort. We develop and evaluate multiple large language models (LLMs) for GDC Cohort Copilot and demonstrate that our locally-served, open-source GDC Cohort LLM achieves better results than GPT-4o prompting in generating GDC cohorts. We implement and share GDC Cohort Copilot as a containerized Gradio app on HuggingFace Spaces, available at https://huggingface.co/spaces/uc-ctds/GDC-Cohort-Copilot. GDC Cohort LLM weights are available at https://huggingface.co/uc-ctds. All source code is available at https://github.com/uc-cdis/gdc-cohort-copilot.

GDC Cohort Copilot: An AI Copilot for Curating Cohorts from the Genomic Data Commons

TL;DR

<3-5 sentence high-level summary> The paper addresses the challenge of constructing precise GDC cohorts from hundreds of filter properties by introducing GDC Cohort Copilot, an open-source system that translates natural-language cohort descriptions into structured GDC cohort filter JSON and exposes an interactive, exportable Gradio interface. Central to the approach is GDC Cohort LLM, trained on paired real user filters and large-scale synthetic data to generate accurate cohort filters, outperforming GPT-4o in their evaluations. The solution combines a core 68-filter-property schema with a containerized web app for easy deployment on HuggingFace Spaces, enabling researchers to curate cohorts via natural language and refine them before exporting to the GDC. Key results show that synthetic data mixtures significantly improve performance, with the final open-source model achieving higher TPR, IoU, Exact, and BERTScore metrics than the GPT-4o baseline. The work provides an accessible, locally-served tool to streamline cancer genomics cohort curation and broadens access through open-source tooling and containerization.</p>

Abstract

The Genomic Data Commons (GDC) provides access to high quality, harmonized cancer genomics data through a unified curation and analysis platform centered around patient cohorts. While GDC users can interactively create complex cohorts through the graphical Cohort Builder, users (especially new ones) may struggle to find specific cohort descriptors across hundreds of possible fields and properties. However, users may be better able to describe their desired cohort in free-text natural language. We introduce GDC Cohort Copilot, an open-source copilot tool for curating cohorts from the GDC. GDC Cohort Copilot automatically generates the GDC cohort filter corresponding to a user-input natural language description of their desired cohort, before exporting the cohort back to the GDC for further analysis. An interactive user interface allows users to further refine the generated cohort. We develop and evaluate multiple large language models (LLMs) for GDC Cohort Copilot and demonstrate that our locally-served, open-source GDC Cohort LLM achieves better results than GPT-4o prompting in generating GDC cohorts. We implement and share GDC Cohort Copilot as a containerized Gradio app on HuggingFace Spaces, available at https://huggingface.co/spaces/uc-ctds/GDC-Cohort-Copilot. GDC Cohort LLM weights are available at https://huggingface.co/uc-ctds. All source code is available at https://github.com/uc-cdis/gdc-cohort-copilot.

Paper Structure

This paper contains 18 sections, 1 figure, 7 tables.

Figures (1)

  • Figure 1: Overview of GDC Cohort Copilot implementation and user workflow. (A) Implementation of GDC Cohort Copilot involves training the GDC Cohort LLM to translate from a natural language query of a cohort to the cohort filter JSON. The cohort JSONs are derived from datasests of real user-made cohorts or synthetically generated cohorts. The paired natural language queries are generated by a frozen LLM using the cohort JSONs. The final trained GDC Cohort LLM model is served in a containerized web app that exposes a GDC Cohort Builder-like interface running on HuggingFace Spaces. (B) A user curates their desired cohort using the GDC Cohort Copilot by: (1) inputting a natural language description of a desired cohort (2) which is automatically passed to GDC Cohort LLM. The model is served using Guidance inside a Gradio app. (3) The resulting generated cohort filter is automatically populated back into the interface, allowing the user to manually refine their cohort before (4) exporting the curated cohort to the NCI GDC.