Table of Contents
Fetching ...

SPIN: Hierarchical Segmentation with Subpart Granularity in Natural Images

Josh Myers-Dean, Jarek Reynolds, Brian Price, Yifei Fan, Danna Gurari

TL;DR

SPIN introduces SubPartImageNet, the first natural-image dataset with exhaustive subpart annotations across 203 subpart categories, enabling subpart granularity in hierarchical segmentation. It also proposes two evaluation metrics, Spatial Consistency Score $SpCS$ and Semantic Consistency Score $SeCS$, to quantify cross-level spatial containment and semantic entailment across object–part–subpart hierarchies, supplementing traditional IoU measures. Through comprehensive benchmarking of open-vocabulary localization, interactive segmentation, and zero-shot semantic recognition, the paper shows substantial gaps in subpart understanding, with notable gains when training on SPIN data and strong cross-level containment in some models. The work provides a public dataset and a framework to push progress in fine-grained hierarchical segmentation with practical implications for captioning, visual QA, AR, and accessibility.

Abstract

Hierarchical segmentation entails creating segmentations at varying levels of granularity. We introduce the first hierarchical semantic segmentation dataset with subpart annotations for natural images, which we call SPIN (SubPartImageNet). We also introduce two novel evaluation metrics to evaluate how well algorithms capture spatial and semantic relationships across hierarchical levels. We benchmark modern models across three different tasks and analyze their strengths and weaknesses across objects, parts, and subparts. To facilitate community-wide progress, we publicly release our dataset at https://joshmyersdean.github.io/spin/index.html.

SPIN: Hierarchical Segmentation with Subpart Granularity in Natural Images

TL;DR

SPIN introduces SubPartImageNet, the first natural-image dataset with exhaustive subpart annotations across 203 subpart categories, enabling subpart granularity in hierarchical segmentation. It also proposes two evaluation metrics, Spatial Consistency Score and Semantic Consistency Score , to quantify cross-level spatial containment and semantic entailment across object–part–subpart hierarchies, supplementing traditional IoU measures. Through comprehensive benchmarking of open-vocabulary localization, interactive segmentation, and zero-shot semantic recognition, the paper shows substantial gaps in subpart understanding, with notable gains when training on SPIN data and strong cross-level containment in some models. The work provides a public dataset and a framework to push progress in fine-grained hierarchical segmentation with practical implications for captioning, visual QA, AR, and accessibility.

Abstract

Hierarchical segmentation entails creating segmentations at varying levels of granularity. We introduce the first hierarchical semantic segmentation dataset with subpart annotations for natural images, which we call SPIN (SubPartImageNet). We also introduce two novel evaluation metrics to evaluate how well algorithms capture spatial and semantic relationships across hierarchical levels. We benchmark modern models across three different tasks and analyze their strengths and weaknesses across objects, parts, and subparts. To facilitate community-wide progress, we publicly release our dataset at https://joshmyersdean.github.io/spin/index.html.
Paper Structure (53 sections, 2 equations, 11 figures, 5 tables)

This paper contains 53 sections, 2 equations, 11 figures, 5 tables.

Figures (11)

  • Figure 1: Overview of the diversity of SPIN. Panels (a) and (b) depict subparts unique to specific object class members, such as a roll cage in a car and a shell in a turtle. Panels (c) and (d) illustrate the variability in the number of subparts per object of the same class, with examples of 13 and 6 subparts. Panels (e) and (f) highlight the disparity in image area coverage by different subparts, such as a bottle label (large) versus quadruped claws (tiny).
  • Figure 2: Histogram the number of unique subpart category labels for each of the 34 part categories. (Aero=Aeroplane; Quad=Quadruped)
  • Figure 3: Boxplots showing the distribution of subpart image occupation (left) and boundary complexities per part, per object (right). The blue lines represent medians, bottoms and tops of each box represent the 25th and 75th percentile values respectively, and whiskers represent the most extreme data points not considered outliers. Overall, SPIN's subparts take up a relatively small number of pixels per image, while featuring a range of geometric complexity. (Aero=Aeroplane; Quad=Quadruped)
  • Figure 4: Interface AMT crowdworkers used to create SPIN's ground truth annotations.
  • Figure 5: Histogram visualizing the number of subpart-part category occurrences (in the thousands) across the SPIN dataset spanning each of the 34 part categories. We note that the biped and quadruped head, and the car body feature the most significant number of subpart occurrences within their parent part. (Aero=Aeroplane; Quad=Quadruped)
  • ...and 6 more figures