Table of Contents
Fetching ...

ArcBERT: An LLM-based Search Engine for Exploring Integrated Multi-Omics Metadata

Gajendra Doniparthi, Shashank Balu Pandhare, Stefan Deßloch, Timo Mühlhaus

TL;DR

ArcBERT introduces a domain-specific semantic search engine for integrated multi-omics metadata within the PLANTdataHUB ecosystem. It combines Sentence-BERT embeddings with FAISS indexing to perform structure-aware retrieval over Annotated Research Contexts, using a hybrid semantic-lexical scoring approach and post-retrieval natural-language summaries. Pre-training on plant-science literature and fine-tuning with ARC metadata enable queries in natural language to retrieve semantically relevant results across hierarchical metadata layers. Experimental comparison with Elasticsearch shows ArcBERT improves semantic matching and contextual understanding, though keyword-based retrieval remains strong in certain categories. Overall, the work advances FAIR data exploration by enabling natural-language, structure-aware search across omics metadata and registry-integrated dataHUBs.

Abstract

Traditional search applications within Research Data Management (RDM) ecosystems are crucial in helping users discover and explore the structured metadata from the research datasets. Typically, text search engines require users to submit keyword-based queries rather than using natural language. However, using Large Language Models (LLMs) trained on domain-specific content for specialized natural language processing (NLP) tasks is becoming increasingly common. We present ArcBERT, an LLM-based system designed for integrated metadata exploration. ArcBERT understands natural language queries and relies on semantic matching, unlike traditional search applications. Notably, ArcBERT also understands the structure and hierarchies within the metadata, enabling it to handle diverse user querying patterns effectively.

ArcBERT: An LLM-based Search Engine for Exploring Integrated Multi-Omics Metadata

TL;DR

ArcBERT introduces a domain-specific semantic search engine for integrated multi-omics metadata within the PLANTdataHUB ecosystem. It combines Sentence-BERT embeddings with FAISS indexing to perform structure-aware retrieval over Annotated Research Contexts, using a hybrid semantic-lexical scoring approach and post-retrieval natural-language summaries. Pre-training on plant-science literature and fine-tuning with ARC metadata enable queries in natural language to retrieve semantically relevant results across hierarchical metadata layers. Experimental comparison with Elasticsearch shows ArcBERT improves semantic matching and contextual understanding, though keyword-based retrieval remains strong in certain categories. Overall, the work advances FAIR data exploration by enabling natural-language, structure-aware search across omics metadata and registry-integrated dataHUBs.

Abstract

Traditional search applications within Research Data Management (RDM) ecosystems are crucial in helping users discover and explore the structured metadata from the research datasets. Typically, text search engines require users to submit keyword-based queries rather than using natural language. However, using Large Language Models (LLMs) trained on domain-specific content for specialized natural language processing (NLP) tasks is becoming increasingly common. We present ArcBERT, an LLM-based system designed for integrated metadata exploration. ArcBERT understands natural language queries and relies on semantic matching, unlike traditional search applications. Notably, ArcBERT also understands the structure and hierarchies within the metadata, enabling it to handle diverse user querying patterns effectively.

Paper Structure

This paper contains 24 sections, 2 equations, 6 figures, 1 table.

Figures (6)

  • Figure 1: ARC folder specification packaging standard format metadata with workflows, scripts for computational pipelines, and result files/artifacts from workflow executions.
  • Figure 2: ARC Metadata Registry integrated with multiple on-premise and remote DataHUBs hosting ARCs.
  • Figure 3: ArcBERT architecture showcasing the Sentence-BERT model, the indexing layer and the query processing layer.
  • Figure 4: Average top-5 similarity scores across query categories for ArcBERT and ElasticSearch.
  • Figure 5: Mean similarity score of the top-1 result per query category.
  • ...and 1 more figures