SlicerChat: Building a Local Chatbot for 3D Slicer

Colton Barr

SlicerChat: Building a Local Chatbot for 3D Slicer

Colton Barr

TL;DR

3D Slicer documentation is diverse and challenging for LLMs, often leading to hallucinations when relying on external services. The authors build SlicerChat, a local chatbot integrated into 3D Slicer using open-source CodeLlama-based models with LoRA fine-tuning and a Retrieval Augmented Generation pipeline over FAISS vector stores, augmented by current MRML scene data. Through a five-question benchmark, they reveal that model size impacts latency far more than fine-tuning benefits, while RAG data sources materially affect answer quality, with Python/Markdown sources being essential and Discourse providing additional value. The work demonstrates the feasibility of a privacy-preserving, extension-native chatbot that can help both novices and developers navigate 3D Slicer more efficiently, while highlighting the importance of data sourcing and system architecture in local AI deployments.

Abstract

3D Slicer is a powerful platform for 3D data visualization and analysis, but has a significant learning curve for new users. Generative AI applications, such as ChatGPT, have emerged as a potential method of bridging the gap between various sources of documentation using natural language. The limited exposure of LLM services to 3D Slicer documentation, however, means that ChatGPT and related services tend to suffer from significant hallucination. The objective of this project is to build a chatbot architecture, called SlicerChat, that is optimized to answer 3D Slicer related questions and able to run locally using an open-source model. The core research questions explored in this work revolve around the answer quality and speed differences due to fine-tuning, model size, and the type of domain knowledge included in the prompt. A prototype SlicerChat system was built as a custom extension in 3D Slicer based on the Code-Llama Instruct architecture. Models of size 1.1B, 7B and 13B were fine-tuned using Low rank Adaptation, and various sources of 3D Slicer documentation were compiled for use in a Retrieval Augmented Generation paradigm. Testing combinations of fine-tuning and model sizes on a benchmark dataset of five 3D Slicer questions revealed that fine-tuning had no impact on model performance or speed compared to the base architecture, and that larger models performed better with a significant speed decrease. Experiments with adding 3D Slicer documentation to the prompt showed that Python sample code and Markdown documentation were the most useful information to include, but that adding 3D Slicer scene data and questions taken from Discourse also improved model performance. In conclusion, this project shows the potential for integrating a high quality, local chatbot directly into 3D Slicer to help new users and experienced developers alike to more efficiently use the software.

SlicerChat: Building a Local Chatbot for 3D Slicer

TL;DR

Abstract

Paper Structure (18 sections, 7 figures)

This paper contains 18 sections, 7 figures.

Introduction
Research Questions
Methods
Data
Model Selection and Finetuning
Retrieval Augmented Generation
3D Slicer Extension and Architecture
Experiments
Benchmark Dataset
RQ1: Comparison of Model Size and Fine-tuning
RQ2: Comparison of RAG Knowledge Sources
Results
Model Fine-Tuning
Comparison of Model Fine-tuning and Architecture
Comparison of RAG Knowledge Sources
...and 3 more sections

Figures (7)

Figure 1: The complete pipeline of SlicerChat. The user starts by entering a prompt, and the extension extracts the current 3D Slicer scene data. This data is then passed along with the query to the seperate python process, where it is accessed as one of 5 sources of RAG data. The resulting RAG prompt is passed to the selected model, and the output from the model is streamed back to 3D Slicer.
Figure 2: The 3D Slicer extension UI for SlicerChat. It includes buttons for starting and connecting to the external LLM process, configuring the base LLM and RAG knowledge to include in the prompt, as well as resetting the conversation and submitting th input prompt. The larger dark square contains the streamed output tokens, while the smaller rectangle at the bottom is where the user enters their prompt.
Figure 3: The model performance on each of the 5 benchmark questions, grouped by the RAG knowledge made available at inference time.
Figure 4: The fine-tuning outputs for all three model architecture sizes. The first row contains the results of the Bayesian Hyperparameter sweeps using Weights and Biases, while the second row shows the loss curve on the evaluation set for each network tested, and the final row indicates the lowest evaluation loss value obtained for all tested networks.
Figure 5: The inference time for each question in seconds grouped by the model used to generate the inference time. Note that for each model size of either 1.1B, 7B or 13B, the base model and fine-tuned models are tested.
...and 2 more figures

SlicerChat: Building a Local Chatbot for 3D Slicer

TL;DR

Abstract

SlicerChat: Building a Local Chatbot for 3D Slicer

Authors

TL;DR

Abstract

Table of Contents

Figures (7)