FISBe: A real-world benchmark dataset for instance segmentation of long-range thin filamentous structures

Lisa Mais; Peter Hirsch; Claire Managan; Ramya Kandarpa; Josef Lorenz Rumberger; Annika Reinke; Lena Maier-Hein; Gudrun Ihrke; Dagmar Kainmueller

FISBe: A real-world benchmark dataset for instance segmentation of long-range thin filamentous structures

Lisa Mais, Peter Hirsch, Claire Managan, Ramya Kandarpa, Josef Lorenz Rumberger, Annika Reinke, Lena Maier-Hein, Gudrun Ihrke, Dagmar Kainmueller

TL;DR

FISBe introduces a real-world 3D light microscopy benchmark for instance segmentation of long-range, thin filamentous neurons, addressing a gap where existing datasets rely on synthetic data. It provides 101 MCFO brain images with pixel‑wise neuron masks and a tailored evaluation suite combining centerline-based avF1 and centerline coverage into a composite score $S = 0.5 \times avF1 + 0.5 \times C$, plus FS and FM for overlap errors. The authors benchmark three baselines (PatchPerPix, FFN, and color clustering) and show that current methods struggle with long-range dependencies and overlaps, highlighting the need for new approaches. By releasing the data, metrics, and baselines, the work aims to spur advances in long-range data modeling and to enable downstream neuroscience analyses, while acknowledging biases toward sparser samples and computational demands. The dataset is hosted with CC BY 4.0 licensing, enabling community use and extension, and future work includes self-supervised pretraining and novel long-range models.

Abstract

Instance segmentation of neurons in volumetric light microscopy images of nervous systems enables groundbreaking research in neuroscience by facilitating joint functional and morphological analyses of neural circuits at cellular resolution. Yet said multi-neuron light microscopy data exhibits extremely challenging properties for the task of instance segmentation: Individual neurons have long-ranging, thin filamentous and widely branching morphologies, multiple neurons are tightly inter-weaved, and partial volume effects, uneven illumination and noise inherent to light microscopy severely impede local disentangling as well as long-range tracing of individual neurons. These properties reflect a current key challenge in machine learning research, namely to effectively capture long-range dependencies in the data. While respective methodological research is buzzing, to date methods are typically benchmarked on synthetic datasets. To address this gap, we release the FlyLight Instance Segmentation Benchmark (FISBe) dataset, the first publicly available multi-neuron light microscopy dataset with pixel-wise annotations. In addition, we define a set of instance segmentation metrics for benchmarking that we designed to be meaningful with regard to downstream analyses. Lastly, we provide three baselines to kick off a competition that we envision to both advance the field of machine learning regarding methodology for capturing long-range data dependencies, and facilitate scientific discovery in basic neuroscience.

FISBe: A real-world benchmark dataset for instance segmentation of long-range thin filamentous structures

TL;DR

, plus FS and FM for overlap errors. The authors benchmark three baselines (PatchPerPix, FFN, and color clustering) and show that current methods struggle with long-range dependencies and overlaps, highlighting the need for new approaches. By releasing the data, metrics, and baselines, the work aims to spur advances in long-range data modeling and to enable downstream neuroscience analyses, while acknowledging biases toward sparser samples and computational demands. The dataset is hosted with CC BY 4.0 licensing, enabling community use and extension, and future work includes self-supervised pretraining and novel long-range models.

Abstract

Paper Structure (29 sections, 6 equations, 14 figures, 6 tables, 1 algorithm)

This paper contains 29 sections, 6 equations, 14 figures, 6 tables, 1 algorithm.

Introduction
Dataset
Image Data Acquisition and Characteristics
Image Selection and Labeling Process
Benchmarking Setup
Evaluation Metrics
Metrics Definitions
Discussion
Baselines
Conclusion
Acknowledgements.
Appendix
Dataset Documentation
Datasheet
Motivation
...and 14 more sections

Figures (14)

Figure 1: Exemplary challenging cases for disentangling neurons in FISBe images (top row), and respective expert annotations (bottom row). (a) Long overlap of two neurons running in parallel, (b) two almost completely overlapping neurons in different color (only one could successfully be annotated), (c) two inter-weaved neurons of same color that could not be separated (clearly identified by two somata), and (d) dim neuron in noisy background.
Figure 2: Visualization of segmentation examples to assess suitable evaluation metrics: (a) Depending on the split position, avF1 can vary significantly at identical gt coverage and error count. (b) Using the avF1 score alone would favor lower coverage over more false split errors. This might be disadvantageous in downstream analysis tasks. In (c) resolving the merge leads to a large improvement in the overall score. In (d) both cases achieve a perfect score wrt. the gt coverage C. By penalizing FP and FS errors in the F1 score the limitations of these predictions are reflected in the overall score. For more edge cases and the full quantitative numbers please see Suppl. Fig. \ref{['fig:edge_cases']}.
Figure 3: Qualitative results for our three baseline methods: PatchPerPix (ppp), Flood Filling Networks (FFN) and Duan et al.'s color clustering. In the top row all three methods yield few correctly segmented neurons (the two green neurons), but ppp and FFN merge the blue and the red one, and Duan et al.'s splits the blue neuron while nicely segmenting the red one. In the bottom row ppp merges many neurons of different color; FFN segments three neurons, but has low coverage; and Duan et al.'s also merges different colored neurons.
Figure 4: Exemplary challenges for many-to-many matching with overlaps. (a) One predicted instance lies completely within an overlapping gt region, but it should only be assigned to one of them; (b) one predicted instance covers one gt and merges with an overlapping gt region, here it should be assigned to the single gt and one of the overlapping ones; and (c) three overlapping predicted instances cover two overlapping gt instances, here only two predicted instances should be matched to the two gt instances respectively (the other predicted instance should rather only count as false positive than as false split). As there are plenty of scenarios how gt and predicted instances can overlap, special treatment for overlapping regions is difficult and error-prone. However, our proposed algorithm (see Alg. 1 in the main paper) naturally handles such overlaps by keeping track of already matched pixels (as opposed to only on the level of instances).
Figure 5: Quantitative assessment of a number of different edge cases of our evaluation metrics (outline: ground truth, color: predictions, $\text{th}=0.5$), highlighting their applicability and validity for FISBe. In (a) we have a perfect prediction, the score is perfect and there are no errors. In (b) we have no prediction, the score is zero and we have as many FN as there are instances. In (c) we have one prediction that covers the whole image; as the clDice value is too low, there is no TP, so avF1 is still zero. When computing the clPrecision for the predicted instance, the corresponding skeleton will likely have the largest overlap with the ground truth background and will be matched to it. Thus, C will be zero as well. In (d) we have a perfect foreground segmentation but the two ground truth instances are merged; the predicted instance is assigned to one of the ground truth instances, resulting in $\text{C} = 0.5$. Assuming clDice $=0.67$ for the one match (and thus $\text{clDice}_{\text{TP}} = 0.67$), we have $\text{F1}=0.67$ for $\text{th}<0.7$ and $0$ otherwise. In (e) we again have a perfect foreground segmentation but there are many small instances; clDice for each pair of predicted and ground truth instances is $<0.1$, thus $\text{avF1}=0$ and $\text{clDice}_{\text{TP}}=0$; however, $\text{C} = 1.0$ because both instances are completely covered (and multiple predicted instances can be matched to one ground truth instance). In (f), (g) and (h) overall slightly more than half of the total ground truth is covered; in (f) both instances are covered slightly more than half; in (g) one instance is covered completely and the other is not; in (h) one instance is covered completely and only a tiny part of the other; to distinguish the cases quantitatively, one has to look at the details: about the same amount of ground truth is covered, thus C has a similar value; in (f) $\text{clDice}_{\text{TP}}$ is worst as both predicted instances are counted as TP, yet both only cover just over half of their respective ground truth instance; furthermore, while avF1 is identical for (f) and (g), when looking at the full range of F1$_{\text{th}}$ values there are more differences: in (f) there are 2 TP for $\text{th}<0.7$, resulting in F1 being equal to 1 for smaller thresholds and equal to 0 for larger thresholds; in (g) there is 1 TP for the full range of thresholds, but also 1 FN; in both cases this results in $\text{avF1}=0.67$; finally in (h) there is 1 TP for the full range of thresholds, 1 FN as in (g), but also 1 FP, resulting in $\text{avF1}=0.5$. In (i) we have a perfect prediction as in (a), but in addition we have a number of small FP, due to noise categorized as foreground; the coverage values are not affected, but the avF1 value drops. One could argue that (h) should be better than (g) as more is detected; however, if a prediction is too small, it is, in general, more likely to be noise. One could also argue that (d) should better than (g), as both neurons are detected, just merged; however, for downstream tasks having one fully correct instance that can directly be used is often more valuable than first having to manually fix errors.
...and 9 more figures

FISBe: A real-world benchmark dataset for instance segmentation of long-range thin filamentous structures

TL;DR

Abstract

FISBe: A real-world benchmark dataset for instance segmentation of long-range thin filamentous structures

Authors

TL;DR

Abstract

Table of Contents

Figures (14)