Domain-specific ChatBots for Science using Embeddings
Kevin G. Yager
TL;DR
Domain-specific ChatBots for Science using Embeddings demonstrates how to adapt large language models for domain science by augmenting prompts with text and image embeddings from a document store. The approach injects domain-specific context into LLM prompts and enables search over publication figures via image embeddings, enabling retrieval-augmented question answering and data interpretation. The study compares unaided versus context-enhanced responses across models and temperatures, highlighting improvements in grounding as well as persistent hallucination risk. It further shows that LLMs can assist with literature comprehension, experimental-detail lookups, and image-based data retrieval, offering a practical path toward developer-built domain tooling for physical sciences.
Abstract
Large language models (LLMs) have emerged as powerful machine-learning systems capable of handling a myriad of tasks. Tuned versions of these systems have been turned into chatbots that can respond to user queries on a vast diversity of topics, providing informative and creative replies. However, their application to physical science research remains limited owing to their incomplete knowledge in these areas, contrasted with the needs of rigor and sourcing in science domains. Here, we demonstrate how existing methods and software tools can be easily combined to yield a domain-specific chatbot. The system ingests scientific documents in existing formats, and uses text embedding lookup to provide the LLM with domain-specific contextual information when composing its reply. We similarly demonstrate that existing image embedding methods can be used for search and retrieval across publication figures. These results confirm that LLMs are already suitable for use by physical scientists in accelerating their research efforts.
