Table of Contents
Fetching ...

BYOCL: Build Your Own Consistent Latent with Hierarchical Representative Latent Clustering

Jiayue Dai, Yunya Wang, Yihan Fang, Yuetong Chen, Butian Xiong

TL;DR

The paper tackles semantic inconsistency in single-image segmentation models like SAM when applied to image sequences. It introduces BYOCL, a zero-shot, plug-and-play pipeline that builds hierarchical representative latent clusters using the SAM image encoder, with intra-batch PCA+K-means and inter-batch refinement to enforce cross-image consistency. On DAVIS and MOSE benchmarks, BYOCL outperforms SAM in segmentation accuracy metrics and dramatically reduces computation time, while also enabling consistent segmentation without model training. The approach demonstrates the viability of hierarchical latent clustering for foundation-model-based segmentation and suggests avenues for improved multi-object segmentation.

Abstract

To address the semantic inconsistency issue with SAM or other single-image segmentation models handling image sequences, we introduce BYOCL. This novel model outperforms SAM in extensive experiments, showcasing its Hierarchical prototype capabilities across CLIP and other representations. BYOCL significantly reduces time and space consumption by dividing inputs into smaller batches, achieving exponential time reduction compared to previous methods. Our approach leverages the SAM image encoder for feature extraction, followed by Intra-Batch and Inter-Batch clustering algorithms. Extensive experiments demonstrate that BYOCL far exceeds the previous state-of-the-art single image segmentation model. Our work is the first to apply consistent segmentation using foundation models without requiring training, utilizing plug-and-play modules for any latent space, making our method highly efficientModels are available at \href{https://github.com/cyt1202/BYOCL.git

BYOCL: Build Your Own Consistent Latent with Hierarchical Representative Latent Clustering

TL;DR

The paper tackles semantic inconsistency in single-image segmentation models like SAM when applied to image sequences. It introduces BYOCL, a zero-shot, plug-and-play pipeline that builds hierarchical representative latent clusters using the SAM image encoder, with intra-batch PCA+K-means and inter-batch refinement to enforce cross-image consistency. On DAVIS and MOSE benchmarks, BYOCL outperforms SAM in segmentation accuracy metrics and dramatically reduces computation time, while also enabling consistent segmentation without model training. The approach demonstrates the viability of hierarchical latent clustering for foundation-model-based segmentation and suggests avenues for improved multi-object segmentation.

Abstract

To address the semantic inconsistency issue with SAM or other single-image segmentation models handling image sequences, we introduce BYOCL. This novel model outperforms SAM in extensive experiments, showcasing its Hierarchical prototype capabilities across CLIP and other representations. BYOCL significantly reduces time and space consumption by dividing inputs into smaller batches, achieving exponential time reduction compared to previous methods. Our approach leverages the SAM image encoder for feature extraction, followed by Intra-Batch and Inter-Batch clustering algorithms. Extensive experiments demonstrate that BYOCL far exceeds the previous state-of-the-art single image segmentation model. Our work is the first to apply consistent segmentation using foundation models without requiring training, utilizing plug-and-play modules for any latent space, making our method highly efficientModels are available at \href{https://github.com/cyt1202/BYOCL.git

Paper Structure

This paper contains 18 sections, 5 figures, 1 table.

Figures (5)

  • Figure 1: image(a) and image(c) are real-world scenes, which are continuously captured pictures. images (b) and image(d) are SAM-segmented results which show the inconsistency problem. Images (e) (f) (g) are inconsistent SAM-segmented results of the grocery-store dataset.
  • Figure 2: Based on SAM image encoder, our method(BYOCL) adds intra-batch clustering and inter-batch clustering algorithms. After the decoder, we get segmented pictures that are semantically consistent. As shown in the graph, the results demonstrate noticeable improvements in semantic consistency compared with SAM.
  • Figure 3: This figure is a detailed description of our method. After we input a sequence of images in our model, these images are tiled in batches with the batch size = 4. Following the coarse-to-fine logic, we first design an Intra-Batch Processing which is composed of a SAM encoder, PCA Downsample, K-means Clustering and Prototyping. SAM Encoder here is used to extract image features.PCA is used to reduce the dimensionality of the features. k-means method is used to cluster the reduced feature vectors. After extract the prototype which is the cardinal feature vector of each group, we then propose Inter-Batch Processing part and input the prototypes into the PCA and K-means clustering.The output will be shown in \ref{['fig:cluster2']}.
  • Figure 4: This clustering result is an example outcome of Intra-batch clustering step.The results of segmentation within a batch are consistent.
  • Figure 5: This clustering result is an example outcome of inter-batch clustering step.The segmentation results of different photos in the same scene prove the consistency of BYOCL.