Table of Contents
Fetching ...

Learning Hierarchical Image Segmentation For Recognition and By Recognition

Tsung-Wei Ke, Sangwoo Mo, Stella X. Yu

TL;DR

This paper tackles the mismatch between segmentation and recognition by treating segmentation as an internal, hierarchical grounding mechanism for recognition. It introduces CAST, a Vision Transformer variant that uses adaptive superpixel tokens and graph pooling to generate a fine-to-coarse segmentation hierarchy while training purely from image-level recognition objectives. CAST demonstrates strong unsupervised hierarchical segmentation, competitive semantic segmentation, and efficient recognition, outperforming SAM, ViT, and HSG on several benchmarks, and enabling test-time adaptation to refine both segmentation and recognition. The approach offers a practical, unified backbone for joint segmentation-recognition tasks, with potential for open-vocabulary segmentation and improved interpretability through part-to-whole parsing.

Abstract

Large vision and language models learned directly through image-text associations often lack detailed visual substantiation, whereas image segmentation tasks are treated separately from recognition, supervisedly learned without interconnections. Our key observation is that, while an image can be recognized in multiple ways, each has a consistent part-and-whole visual organization. Segmentation thus should be treated not as an end task to be mastered through supervised learning, but as an internal process that evolves with and supports the ultimate goal of recognition. We propose to integrate a hierarchical segmenter into the recognition process, train and adapt the entire model solely on image-level recognition objectives. We learn hierarchical segmentation for free alongside recognition, automatically uncovering part-to-whole relationships that not only underpin but also enhance recognition. Enhancing the Vision Transformer (ViT) with adaptive segment tokens and graph pooling, our model surpasses ViT in unsupervised part-whole discovery, semantic segmentation, image classification, and efficiency. Notably, our model (trained on unlabeled 1M ImageNet images) outperforms SAM (trained on 11M images and 1 billion masks) by absolute 8% in mIoU on PartImageNet object segmentation.

Learning Hierarchical Image Segmentation For Recognition and By Recognition

TL;DR

This paper tackles the mismatch between segmentation and recognition by treating segmentation as an internal, hierarchical grounding mechanism for recognition. It introduces CAST, a Vision Transformer variant that uses adaptive superpixel tokens and graph pooling to generate a fine-to-coarse segmentation hierarchy while training purely from image-level recognition objectives. CAST demonstrates strong unsupervised hierarchical segmentation, competitive semantic segmentation, and efficient recognition, outperforming SAM, ViT, and HSG on several benchmarks, and enabling test-time adaptation to refine both segmentation and recognition. The approach offers a practical, unified backbone for joint segmentation-recognition tasks, with potential for open-vocabulary segmentation and improved interpretability through part-to-whole parsing.

Abstract

Large vision and language models learned directly through image-text associations often lack detailed visual substantiation, whereas image segmentation tasks are treated separately from recognition, supervisedly learned without interconnections. Our key observation is that, while an image can be recognized in multiple ways, each has a consistent part-and-whole visual organization. Segmentation thus should be treated not as an end task to be mastered through supervised learning, but as an internal process that evolves with and supports the ultimate goal of recognition. We propose to integrate a hierarchical segmenter into the recognition process, train and adapt the entire model solely on image-level recognition objectives. We learn hierarchical segmentation for free alongside recognition, automatically uncovering part-to-whole relationships that not only underpin but also enhance recognition. Enhancing the Vision Transformer (ViT) with adaptive segment tokens and graph pooling, our model surpasses ViT in unsupervised part-whole discovery, semantic segmentation, image classification, and efficiency. Notably, our model (trained on unlabeled 1M ImageNet images) outperforms SAM (trained on 11M images and 1 billion masks) by absolute 8% in mIoU on PartImageNet object segmentation.
Paper Structure (40 sections, 4 equations, 24 figures, 12 tables, 2 algorithms)

This paper contains 40 sections, 4 equations, 24 figures, 12 tables, 2 algorithms.

Figures (24)

  • Figure 1: Our insight is that image segmentation and recognition form a visual parsing continuum, and their consistency is more essential than individual text labels for recognition. We may recognize this image as ink, girl, or woman. While the foreground (colored areas) may vary, it always has a consistent hierarchical segmentation: threeindividualblobs when no person is recognized, or parts (face, hair) of the person recognized as girl or woman. Instead of treating segmentation and recognition as separate tasks, we model them concurrently by including segmentation in the loop for recognition. With recognition objectives solely at the image level, not only can hierarchical segmentation be learned for free, but better and substantiated recognition also arises from such internal part-to-whole consistency.
  • Figure 2: While prior work uses patches as visual units and treats segmentation and recognition as separate, supervised tasks with distinct models and data, our work uses superpixels as visual units and integrates hierarchical segmentation into the recognition process, learning it internally from a single recognition objective. Classifiers like ViT dosovitskiy2020image learn recognition from image-level labels. Semantic Segmenters such as Segmenter strudel2021segmenter learn object segments from pixel-level class labels but lack part-whole granularity. Boundary Segmenters like SAM kirillov2023segment learns regions of multiple granualities from boundary labels without hierarchical organization. In contrast, our Segmenter for Recognition (CAST) integrates a fine-to-coarse segment hierarchy directly into the recognition process. By graph-pooling over segment tokens, it effectively solves all three tasks concurrently within a visual parsing continuum.
  • Figure 3: Our model performs segmentation and recognition simultaneously during test-time adaptation: Initial predictions in a feedforward hierarchy capture vision at a glance, whereas enhancements in a reverse hierarchy captures vision with scrutiny. It processes an image of a dog, human, and car in a feed-forward hierarchy, initially recognizing the dog with $54\%$ activation based on only the back of the dog. After backpropagating to increase dog activation, the model undergoes test-time adaptation in a reverse hierarchy. This adjustment allows the next feed-forward process to uncover the whole dog and boost dog activation to $97\%$! Our segmentation and recognition thus mutually influence and enhance each other.
  • Figure 4: Our model implements our concept of concurrency and consistency in visual parsing by innovating ViT with adaptive segment tokens and progressive graph pooling. It starts with superpixels instead of square patches, and applies graph pooling to merge fine segments $S_{l-1}$ into coarse segments $S_l$. Both segment transition probability $P_l$ and segment feature $Z_l$ are learned to optimize an image-level recognition objective, which could be self-supervised instance discrimination or supervised image classification. Without any external supervision, we uncover object wholes ( dog) along with small details ( ears) and thin structures ( legs), validating the effectiveness of our concept.
  • Figure 5: CAST uncovers objects with complex contours, due to the use of not only superpixels but also progressive token pooling. We train ViT and CAST on unlabeled ImageNet data using the MoCo objective he2020momentum. For the ImageNet image in Column 1 of each row, Columns 2-9 show respectively its square patches used by ViT, 32,16,8-way segmentations derived from ViT tokens via fine-to-coarse K-means clustering, superpixels used by CAST, and 32,16,8-way segmentations generated by CAST. Our color scheme has coarse-to-fine consistency: Colors in 8-way segmentations are matched between ViT and CAST, while colors in 16(32)-way segmentations have the same hues as 8-way but vary in saturation(value) to reflect finer details. Our results more closely follow visual contours and successfully uncover entire objects with details like neck, thin legs, and long ears.
  • ...and 19 more figures