Table of Contents
Fetching ...

ConnectomeBench: Can LLMs Proofread the Connectome?

Jeff Brown, Andrew Kirjner, Annika Vivekananthan, Ed Boyden

TL;DR

ConnectomeBench introduces a multimodal benchmarking suite to assess LLMs on three core connectome proofreading tasks: segment identification, split error correction, and merge error identification, using MICrONS and FlyWire datasets. The study demonstrates that current LLMs can surpass null baselines and achieve strong performance in segment-type classification and split-error tasks, with significant gains from prompt design and reasoning heuristics, though merge-error detection remains comparatively challenging. By evaluating proprietary and open-weight models across binary and multiple-choice formats, the work establishes a baseline of capability and identifies concrete avenues—such as heuristic-driven prompting and agent-style workflows—for advancing toward AI-assisted proofreading. The ConnectomeBench framework thus provides a rigorous, standardized way to track progress toward automated or semi-automated proofreading in connectomics, with potential to dramatically reduce human proofreading effort in future large-scale brain connectomes.

Abstract

Connectomics - the mapping of neural connections in an organism's brain - currently requires extraordinary human effort to proofread the data collected from imaging and machine-learning assisted segmentation. With the growing excitement around using AI agents to automate important scientific tasks, we explore whether current AI systems can perform multiple tasks necessary for data proofreading. We introduce ConnectomeBench, a multimodal benchmark evaluating large language model (LLM) capabilities in three critical proofreading tasks: segment type identification, split error correction, and merge error detection. Using expert annotated data from two large open-source datasets - a cubic millimeter of mouse visual cortex and the complete Drosophila brain - we evaluate proprietary multimodal LLMs including Claude 3.7/4 Sonnet, o4-mini, GPT-4.1, GPT-4o, as well as open source models like InternVL-3 and NVLM. Our results demonstrate that current models achieve surprisingly high performance in segment identification (52-82% balanced accuracy vs. 20-25% chance) and binary/multiple choice split error correction (75-85% accuracy vs. 50% chance) while generally struggling on merge error identification tasks. Overall, while the best models still lag behind expert performance, they demonstrate promising capabilities that could eventually enable them to augment and potentially replace human proofreading in connectomics. Project page: https://github.com/jffbrwn2/ConnectomeBench and Dataset https://huggingface.co/datasets/jeffbbrown2/ConnectomeBench/tree/main

ConnectomeBench: Can LLMs Proofread the Connectome?

TL;DR

ConnectomeBench introduces a multimodal benchmarking suite to assess LLMs on three core connectome proofreading tasks: segment identification, split error correction, and merge error identification, using MICrONS and FlyWire datasets. The study demonstrates that current LLMs can surpass null baselines and achieve strong performance in segment-type classification and split-error tasks, with significant gains from prompt design and reasoning heuristics, though merge-error detection remains comparatively challenging. By evaluating proprietary and open-weight models across binary and multiple-choice formats, the work establishes a baseline of capability and identifies concrete avenues—such as heuristic-driven prompting and agent-style workflows—for advancing toward AI-assisted proofreading. The ConnectomeBench framework thus provides a rigorous, standardized way to track progress toward automated or semi-automated proofreading in connectomics, with potential to dramatically reduce human proofreading effort in future large-scale brain connectomes.

Abstract

Connectomics - the mapping of neural connections in an organism's brain - currently requires extraordinary human effort to proofread the data collected from imaging and machine-learning assisted segmentation. With the growing excitement around using AI agents to automate important scientific tasks, we explore whether current AI systems can perform multiple tasks necessary for data proofreading. We introduce ConnectomeBench, a multimodal benchmark evaluating large language model (LLM) capabilities in three critical proofreading tasks: segment type identification, split error correction, and merge error detection. Using expert annotated data from two large open-source datasets - a cubic millimeter of mouse visual cortex and the complete Drosophila brain - we evaluate proprietary multimodal LLMs including Claude 3.7/4 Sonnet, o4-mini, GPT-4.1, GPT-4o, as well as open source models like InternVL-3 and NVLM. Our results demonstrate that current models achieve surprisingly high performance in segment identification (52-82% balanced accuracy vs. 20-25% chance) and binary/multiple choice split error correction (75-85% accuracy vs. 50% chance) while generally struggling on merge error identification tasks. Overall, while the best models still lag behind expert performance, they demonstrate promising capabilities that could eventually enable them to augment and potentially replace human proofreading in connectomics. Project page: https://github.com/jffbrwn2/ConnectomeBench and Dataset https://huggingface.co/datasets/jeffbbrown2/ConnectomeBench/tree/main

Paper Structure

This paper contains 31 sections, 12 figures, 7 tables.

Figures (12)

  • Figure 1: Summary of the three tasks evaluated in ConnectomeBench. In the left panel are examples of four different types of segments: A) single neuron, B) multiple neurons merged together, C) neuronal processes with a cell body (soma), and D) isolated cell nucleus. Examples of 3D segments of non-neuronal cell types can be found in Appendix \ref{['nonneuronal-celltype']}. In the right upper panel is an example of segment with a split error (in blue) and two potential merge candidate to correct the error (in orange). On the left is a correct merge candidate; on the right is an incorrect merge candidate. In the right bottom panel are examples of segments with and without merge errors (on the right and left respectively).
  • Figure 2: Example prompt used for classifying the segment type. Text in blue is the additional context provided in the "Description" prompts. In this case, the correct answer would be (b).
  • Figure 3: Prompt used for identifying split error corrections. Text in blue is the additional context included in the "Description" prompts.
  • Figure 4: ROC Curves for the binary split error correction and merge error identification tasks. TPR=True Positive Rate, FPR=False Positive Rate.
  • Figure 5: Prompt used for identifying merge errors. Text in blue is the additional context included in the "Description" prompts.
  • ...and 7 more figures