Table of Contents
Fetching ...

Towards LLM-Powered Task-Aware Retrieval of Scientific Workflows for Galaxy

Shamse Tasnim Cynthia, Banani Roy

TL;DR

This work tackles the limited semantic reach of keyword-based Galaxy workflow retrieval by introducing WorkflowExplorer, a two-stage system that combines dense embeddings for candidate retrieval with LLM-based reranking to capture task intent. It builds and evaluates task-aware retrieval in Galaxy using a benchmark of semantically labeled workflows and LLM-generated, task-driven queries, demonstrating superior top-k accuracy and relevance over lexical methods. The study also demonstrates feasibility through a Galaxy-integrated prototype and analyzes latency, cost, and practical deployment considerations. The findings highlight the importance of semantic representations and instruction-tuned LLMs for improving discoverability and reuse of scientific workflows, particularly for novices and interdisciplinary researchers.

Abstract

Scientific Workflow Management Systems (SWfMSs) such as Galaxy have become essential infrastructure in bioinformatics, supporting the design, execution, and sharing of complex multi-step analyses. Despite hosting hundreds of reusable workflows across domains, Galaxy's current keyword-based retrieval system offers limited support for semantic query interpretation and often fails to surface relevant workflows when exact term matches are absent. To address this gap, we propose a task-aware, two-stage retrieval framework that integrates dense vector search with large language model (LLM)-based reranking. Our system first retrieves candidate workflows using state-of-the-art embedding models and then reranks them using instruction-tuned generative LLMs (GPT-4o, Mistral-7B) based on semantic task alignment. To support robust evaluation, we construct a benchmark dataset of Galaxy workflows annotated with semantic topics via BERTopic and synthesize realistic task-oriented queries using LLMs. We conduct a comprehensive comparison of lexical, dense, and reranking models using standard IR metrics, presenting the first systematic evaluation of retrieval performance in the Galaxy ecosystem. Results show that our approach significantly improves top-k accuracy and relevance, particularly for long or under-specified queries. We further integrate our system as a prototype tool within Galaxy, providing a proof-of-concept for LLM-enhanced workflow search. This work advances the usability and accessibility of scientific workflows, especially for novice users and interdisciplinary researchers.

Towards LLM-Powered Task-Aware Retrieval of Scientific Workflows for Galaxy

TL;DR

This work tackles the limited semantic reach of keyword-based Galaxy workflow retrieval by introducing WorkflowExplorer, a two-stage system that combines dense embeddings for candidate retrieval with LLM-based reranking to capture task intent. It builds and evaluates task-aware retrieval in Galaxy using a benchmark of semantically labeled workflows and LLM-generated, task-driven queries, demonstrating superior top-k accuracy and relevance over lexical methods. The study also demonstrates feasibility through a Galaxy-integrated prototype and analyzes latency, cost, and practical deployment considerations. The findings highlight the importance of semantic representations and instruction-tuned LLMs for improving discoverability and reuse of scientific workflows, particularly for novices and interdisciplinary researchers.

Abstract

Scientific Workflow Management Systems (SWfMSs) such as Galaxy have become essential infrastructure in bioinformatics, supporting the design, execution, and sharing of complex multi-step analyses. Despite hosting hundreds of reusable workflows across domains, Galaxy's current keyword-based retrieval system offers limited support for semantic query interpretation and often fails to surface relevant workflows when exact term matches are absent. To address this gap, we propose a task-aware, two-stage retrieval framework that integrates dense vector search with large language model (LLM)-based reranking. Our system first retrieves candidate workflows using state-of-the-art embedding models and then reranks them using instruction-tuned generative LLMs (GPT-4o, Mistral-7B) based on semantic task alignment. To support robust evaluation, we construct a benchmark dataset of Galaxy workflows annotated with semantic topics via BERTopic and synthesize realistic task-oriented queries using LLMs. We conduct a comprehensive comparison of lexical, dense, and reranking models using standard IR metrics, presenting the first systematic evaluation of retrieval performance in the Galaxy ecosystem. Results show that our approach significantly improves top-k accuracy and relevance, particularly for long or under-specified queries. We further integrate our system as a prototype tool within Galaxy, providing a proof-of-concept for LLM-enhanced workflow search. This work advances the usability and accessibility of scientific workflows, especially for novice users and interdisciplinary researchers.

Paper Structure

This paper contains 27 sections, 2 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Overview of our proposed framework
  • Figure 2: Schematic diagram of the methodology
  • Figure 3: WorkflowExplorer tool interface in Galaxy: users provide a high-level natural language query.
  • Figure 4: Galaxy job confirmation view with a hyperlink to the HTML results page.
  • Figure 5: HTML output showing top-matched workflows, descriptions, and direct .ga file downloads.