ConnectomeBench: Can LLMs Proofread the Connectome?
Jeff Brown, Andrew Kirjner, Annika Vivekananthan, Ed Boyden
TL;DR
ConnectomeBench introduces a multimodal benchmarking suite to assess LLMs on three core connectome proofreading tasks: segment identification, split error correction, and merge error identification, using MICrONS and FlyWire datasets. The study demonstrates that current LLMs can surpass null baselines and achieve strong performance in segment-type classification and split-error tasks, with significant gains from prompt design and reasoning heuristics, though merge-error detection remains comparatively challenging. By evaluating proprietary and open-weight models across binary and multiple-choice formats, the work establishes a baseline of capability and identifies concrete avenues—such as heuristic-driven prompting and agent-style workflows—for advancing toward AI-assisted proofreading. The ConnectomeBench framework thus provides a rigorous, standardized way to track progress toward automated or semi-automated proofreading in connectomics, with potential to dramatically reduce human proofreading effort in future large-scale brain connectomes.
Abstract
Connectomics - the mapping of neural connections in an organism's brain - currently requires extraordinary human effort to proofread the data collected from imaging and machine-learning assisted segmentation. With the growing excitement around using AI agents to automate important scientific tasks, we explore whether current AI systems can perform multiple tasks necessary for data proofreading. We introduce ConnectomeBench, a multimodal benchmark evaluating large language model (LLM) capabilities in three critical proofreading tasks: segment type identification, split error correction, and merge error detection. Using expert annotated data from two large open-source datasets - a cubic millimeter of mouse visual cortex and the complete Drosophila brain - we evaluate proprietary multimodal LLMs including Claude 3.7/4 Sonnet, o4-mini, GPT-4.1, GPT-4o, as well as open source models like InternVL-3 and NVLM. Our results demonstrate that current models achieve surprisingly high performance in segment identification (52-82% balanced accuracy vs. 20-25% chance) and binary/multiple choice split error correction (75-85% accuracy vs. 50% chance) while generally struggling on merge error identification tasks. Overall, while the best models still lag behind expert performance, they demonstrate promising capabilities that could eventually enable them to augment and potentially replace human proofreading in connectomics. Project page: https://github.com/jffbrwn2/ConnectomeBench and Dataset https://huggingface.co/datasets/jeffbbrown2/ConnectomeBench/tree/main
