Towards LLM-Powered Task-Aware Retrieval of Scientific Workflows for Galaxy
Shamse Tasnim Cynthia, Banani Roy
TL;DR
This work tackles the limited semantic reach of keyword-based Galaxy workflow retrieval by introducing WorkflowExplorer, a two-stage system that combines dense embeddings for candidate retrieval with LLM-based reranking to capture task intent. It builds and evaluates task-aware retrieval in Galaxy using a benchmark of semantically labeled workflows and LLM-generated, task-driven queries, demonstrating superior top-k accuracy and relevance over lexical methods. The study also demonstrates feasibility through a Galaxy-integrated prototype and analyzes latency, cost, and practical deployment considerations. The findings highlight the importance of semantic representations and instruction-tuned LLMs for improving discoverability and reuse of scientific workflows, particularly for novices and interdisciplinary researchers.
Abstract
Scientific Workflow Management Systems (SWfMSs) such as Galaxy have become essential infrastructure in bioinformatics, supporting the design, execution, and sharing of complex multi-step analyses. Despite hosting hundreds of reusable workflows across domains, Galaxy's current keyword-based retrieval system offers limited support for semantic query interpretation and often fails to surface relevant workflows when exact term matches are absent. To address this gap, we propose a task-aware, two-stage retrieval framework that integrates dense vector search with large language model (LLM)-based reranking. Our system first retrieves candidate workflows using state-of-the-art embedding models and then reranks them using instruction-tuned generative LLMs (GPT-4o, Mistral-7B) based on semantic task alignment. To support robust evaluation, we construct a benchmark dataset of Galaxy workflows annotated with semantic topics via BERTopic and synthesize realistic task-oriented queries using LLMs. We conduct a comprehensive comparison of lexical, dense, and reranking models using standard IR metrics, presenting the first systematic evaluation of retrieval performance in the Galaxy ecosystem. Results show that our approach significantly improves top-k accuracy and relevance, particularly for long or under-specified queries. We further integrate our system as a prototype tool within Galaxy, providing a proof-of-concept for LLM-enhanced workflow search. This work advances the usability and accessibility of scientific workflows, especially for novice users and interdisciplinary researchers.
