Table of Contents
Fetching ...

Domain-specific ChatBots for Science using Embeddings

Kevin G. Yager

TL;DR

Domain-specific ChatBots for Science using Embeddings demonstrates how to adapt large language models for domain science by augmenting prompts with text and image embeddings from a document store. The approach injects domain-specific context into LLM prompts and enables search over publication figures via image embeddings, enabling retrieval-augmented question answering and data interpretation. The study compares unaided versus context-enhanced responses across models and temperatures, highlighting improvements in grounding as well as persistent hallucination risk. It further shows that LLMs can assist with literature comprehension, experimental-detail lookups, and image-based data retrieval, offering a practical path toward developer-built domain tooling for physical sciences.

Abstract

Large language models (LLMs) have emerged as powerful machine-learning systems capable of handling a myriad of tasks. Tuned versions of these systems have been turned into chatbots that can respond to user queries on a vast diversity of topics, providing informative and creative replies. However, their application to physical science research remains limited owing to their incomplete knowledge in these areas, contrasted with the needs of rigor and sourcing in science domains. Here, we demonstrate how existing methods and software tools can be easily combined to yield a domain-specific chatbot. The system ingests scientific documents in existing formats, and uses text embedding lookup to provide the LLM with domain-specific contextual information when composing its reply. We similarly demonstrate that existing image embedding methods can be used for search and retrieval across publication figures. These results confirm that LLMs are already suitable for use by physical scientists in accelerating their research efforts.

Domain-specific ChatBots for Science using Embeddings

TL;DR

Domain-specific ChatBots for Science using Embeddings demonstrates how to adapt large language models for domain science by augmenting prompts with text and image embeddings from a document store. The approach injects domain-specific context into LLM prompts and enables search over publication figures via image embeddings, enabling retrieval-augmented question answering and data interpretation. The study compares unaided versus context-enhanced responses across models and temperatures, highlighting improvements in grounding as well as persistent hallucination risk. It further shows that LLMs can assist with literature comprehension, experimental-detail lookups, and image-based data retrieval, offering a practical path toward developer-built domain tooling for physical sciences.

Abstract

Large language models (LLMs) have emerged as powerful machine-learning systems capable of handling a myriad of tasks. Tuned versions of these systems have been turned into chatbots that can respond to user queries on a vast diversity of topics, providing informative and creative replies. However, their application to physical science research remains limited owing to their incomplete knowledge in these areas, contrasted with the needs of rigor and sourcing in science domains. Here, we demonstrate how existing methods and software tools can be easily combined to yield a domain-specific chatbot. The system ingests scientific documents in existing formats, and uses text embedding lookup to provide the LLM with domain-specific contextual information when composing its reply. We similarly demonstrate that existing image embedding methods can be used for search and retrieval across publication figures. These results confirm that LLMs are already suitable for use by physical scientists in accelerating their research efforts.
Paper Structure (18 sections, 4 figures, 1 table)

This paper contains 18 sections, 4 figures, 1 table.

Figures (4)

  • Figure 1: The LLM (OpenAI GPT 3.5) is used to rank documents by "potential for scientific impact", using pairwise comparisons where the LLM judges the impact of two scientific documents. The LLM has access to the article text (title, abstract, main text) but no ancillary information such as the name of the journal the paper was published in. The pairwise comparisons are performed on a random set of connections. We ensure that every publication has undergone at least one comparison, but do not compute a dense set of all possible comparisons (818 comparisons, out of a total possible $176^2=30,976$). Using the pairwise comparisons, we then sort the articles into a ranking from lowest impact to highest impact. The sorting is performed by starting with a random order, and then iteratively considering pairs of articles (we iterate both through the current list order, and through the list of comparisons) and accepting a swap if it decreases the total number of misordered pairs. This procedure gradually decreases the fraction of elements that are misordered relative to each other. This fraction does not decrease to zero because there is no guarantee that the pairwise evaluations form a perfectly consistent ordering (viewed as a directed graph, there are cycles in the graph). This sorting yields an ordering where only $8.1\%$ of comparisons are misordered. The graph compares overall win ratio (percentage of time a given document was deemed "higher impact" in pairwise comparisons) and uses connecting lines to show the direction of comparison (red lines denote misordered comparisons that could not be satisfied).
  • Figure 2: The LLM ranking of publications (by potential for impact) is compared against the impact factor of the journal the manuscript was published in. There is, broadly speaking, agreement between the ordering of publications by LLM assessment and the impact factor. For instance, the highest impact journal articles are indeed rank among the highest by the LLM. Of course, perfect agreement is not expected, since impact assessment is inherently imprecise and subjective; moreover journal impact factor is known to be a coarse proxy for scientific impact. The coefficient of determination for a linear fit to the data ($R^{2}\approx0.15$) suggests some measure of positive correlation between these metrics. Note for the given dataset even perfect sorting by impact factor would not yield perfect correlation (but rather $R^{2}\approx0.69$) since ranking is a contiguous integer list while impact factor is a continuous variable with a non-uniform distribution.
  • Figure 3: Examples of image retrieval from a database of $50,923$ images. Input images are small-angle x-ray scattering (SAXS) detector images, including grazing-incidence (GISAXS) data, collected at the Complex Materials Scattering (CMS, 11-BM) beamline at the National Synchrotron Light Source II (NSLS-II). Examples are provided for Euclidian distance (which measures how close in meaning the images are), cosine similarity (which measures how similar in theme or topic the images are), and dot product (which measures overlap in the underlying concepts). Retrieved images show meaningful similarity.
  • Figure 4: Examples of image retrieval for SAXS/GISAXS inputs, where images from the same beamline experiment as the input were excluded. This demonstrates the ability to discover similar (conceptually related) data in different experiments (or even different materials).