ArcBERT: An LLM-based Search Engine for Exploring Integrated Multi-Omics Metadata
Gajendra Doniparthi, Shashank Balu Pandhare, Stefan Deßloch, Timo Mühlhaus
TL;DR
ArcBERT introduces a domain-specific semantic search engine for integrated multi-omics metadata within the PLANTdataHUB ecosystem. It combines Sentence-BERT embeddings with FAISS indexing to perform structure-aware retrieval over Annotated Research Contexts, using a hybrid semantic-lexical scoring approach and post-retrieval natural-language summaries. Pre-training on plant-science literature and fine-tuning with ARC metadata enable queries in natural language to retrieve semantically relevant results across hierarchical metadata layers. Experimental comparison with Elasticsearch shows ArcBERT improves semantic matching and contextual understanding, though keyword-based retrieval remains strong in certain categories. Overall, the work advances FAIR data exploration by enabling natural-language, structure-aware search across omics metadata and registry-integrated dataHUBs.
Abstract
Traditional search applications within Research Data Management (RDM) ecosystems are crucial in helping users discover and explore the structured metadata from the research datasets. Typically, text search engines require users to submit keyword-based queries rather than using natural language. However, using Large Language Models (LLMs) trained on domain-specific content for specialized natural language processing (NLP) tasks is becoming increasingly common. We present ArcBERT, an LLM-based system designed for integrated metadata exploration. ArcBERT understands natural language queries and relies on semantic matching, unlike traditional search applications. Notably, ArcBERT also understands the structure and hierarchies within the metadata, enabling it to handle diverse user querying patterns effectively.
