Table of Contents
Fetching ...

AiSciVision: A Framework for Specializing Large Multimodal Models in Scientific Image Classification

Brendan Hogan, Anmol Kabra, Felipe Siqueira Pacheco, Laura Greenstreet, Joshua Fan, Aaron Ferber, Marta Ummus, Alecsander Brito, Olivia Graham, Lillian Aoki, Drew Harvell, Alex Flecker, Carla Gomes

TL;DR

AiSciVision is introduced, a framework that specializes Large Multimodal Models (LMMs) into interactive research partners and classification models for image classification tasks in niche scientific domains and outperforms fully supervised models in low and full-labeled data settings.

Abstract

Trust and interpretability are crucial for the use of Artificial Intelligence (AI) in scientific research, but current models often operate as black boxes offering limited transparency and justifications for their outputs. We introduce AiSciVision, a framework that specializes Large Multimodal Models (LMMs) into interactive research partners and classification models for image classification tasks in niche scientific domains. Our framework uses two key components: (1) Visual Retrieval-Augmented Generation (VisRAG) and (2) domain-specific tools utilized in an agentic workflow. To classify a target image, AiSciVision first retrieves the most similar positive and negative labeled images as context for the LMM. Then the LMM agent actively selects and applies tools to manipulate and inspect the target image over multiple rounds, refining its analysis before making a final prediction. These VisRAG and tooling components are designed to mirror the processes of domain experts, as humans often compare new data to similar examples and use specialized tools to manipulate and inspect images before arriving at a conclusion. Each inference produces both a prediction and a natural language transcript detailing the reasoning and tool usage that led to the prediction. We evaluate AiSciVision on three real-world scientific image classification datasets: detecting the presence of aquaculture ponds, diseased eelgrass, and solar panels. Across these datasets, our method outperforms fully supervised models in low and full-labeled data settings. AiSciVision is actively deployed in real-world use, specifically for aquaculture research, through a dedicated web application that displays and allows the expert users to converse with the transcripts. This work represents a crucial step toward AI systems that are both interpretable and effective, advancing their use in scientific research and scientific discovery.

AiSciVision: A Framework for Specializing Large Multimodal Models in Scientific Image Classification

TL;DR

AiSciVision is introduced, a framework that specializes Large Multimodal Models (LMMs) into interactive research partners and classification models for image classification tasks in niche scientific domains and outperforms fully supervised models in low and full-labeled data settings.

Abstract

Trust and interpretability are crucial for the use of Artificial Intelligence (AI) in scientific research, but current models often operate as black boxes offering limited transparency and justifications for their outputs. We introduce AiSciVision, a framework that specializes Large Multimodal Models (LMMs) into interactive research partners and classification models for image classification tasks in niche scientific domains. Our framework uses two key components: (1) Visual Retrieval-Augmented Generation (VisRAG) and (2) domain-specific tools utilized in an agentic workflow. To classify a target image, AiSciVision first retrieves the most similar positive and negative labeled images as context for the LMM. Then the LMM agent actively selects and applies tools to manipulate and inspect the target image over multiple rounds, refining its analysis before making a final prediction. These VisRAG and tooling components are designed to mirror the processes of domain experts, as humans often compare new data to similar examples and use specialized tools to manipulate and inspect images before arriving at a conclusion. Each inference produces both a prediction and a natural language transcript detailing the reasoning and tool usage that led to the prediction. We evaluate AiSciVision on three real-world scientific image classification datasets: detecting the presence of aquaculture ponds, diseased eelgrass, and solar panels. Across these datasets, our method outperforms fully supervised models in low and full-labeled data settings. AiSciVision is actively deployed in real-world use, specifically for aquaculture research, through a dedicated web application that displays and allows the expert users to converse with the transcripts. This work represents a crucial step toward AI systems that are both interpretable and effective, advancing their use in scientific research and scientific discovery.

Paper Structure

This paper contains 24 sections, 1 equation, 4 figures, 3 tables.

Figures (4)

  • Figure 1: A schematic of our AISciVision framework, comprising of two components (1) Visual Retrieval-Augmented Generation (VisRAG) and (2) Domain-specific Tools. Given an input test image, we use the VisRAG component to retrieve the most similar positive and negative examples from the training dataset. We then prompt the LMM to compare these images with the test image, and utilize domain-specific interactive tools over several rounds in a conversation where the LMM acts as an AI agent. Finally, the agent outputs a prediction and the inference transcript. The transcript offers insight into the agent's reasoning process, improving interpretability, transparency, and trust, all crucial for applications in scientific domains.
  • Figure 2: Four examples images from each dataset: Aquaculture, Eelgrass, and Solar (left to right). The top row are all positive examples (aquaculture ponds present, eelgrass plant diseased, and solar panels present), and the bottom row are all negative examples.
  • Figure 3: We analyze the frequency of tool use in AISciVision for each dataset, and their effects on the final classification result. Blue bars indicate the number of tool calls during inference on our test sets of 100 images, whereas green bars indicate the mean classification accuracy when the specified tool was called in a conversation in any round. Note that the LMM can call a tool multiple times in the same conversation. Across all datasets, the LMM almost always requests the supervised model tool "MLToolPredict" but does not solely rely on the returned results. The LMM heavily relies on geospatial tools for the Aquaculture dataset, since these tools return additional information from the vicinity. It is curious that the LMM does not use the AdjustBrightness tool at all and uses HistogramEqualization sparingly for the other two datasets. All tools are described in \ref{['app:dataset_specific_tools']}.
  • Figure 4: Screenshot of the AISciVision web application for aquaculture pond detection. The application allows users to upload satellite images, interact with visual tools such as zoom and contrast adjustment, and view predictions alongside retrieved similar examples. Additionally, the produced transcript is conversable in a 'Chat-GPT' like manner. Allowing the user to ask questions about the inference steps, or to even give rich natural language feedback about errors the model might have made. For future work, we aim to incorporate this feedback into the VisRAG, enabling an ever-learning RAG system that continually improves our model.