Table of Contents
Fetching ...

MixReorg: Cross-Modal Mixed Patch Reorganization is a Good Mask Learner for Open-World Semantic Segmentation

Kaixin Cai, Pengzhen Ren, Yi Zhu, Hang Xu, Jianzhuang Liu, Changlin Li, Guangrun Wang, Xiaodan Liang

TL;DR

This work tackles open-world semantic segmentation by leveraging image-text supervision to achieve fine-grained pixel-text alignment. It introduces MixReorg, a cross-modal mixed patch reorganization framework that constructs patch-text data from image-text pairs via contextual and progressive mixing, plus a mixing restoration stage. The training objective combines a mixed-image segmentation loss with a restoration-based, cross-modal contrastive loss and a cross-modal original-image-to-text loss, yielding strong zero-shot performance on PASCAL VOC, PASCAL Context, COCO, and ADE20K. Empirical results show substantial improvements over GroupViT and other zero-shot baselines, with ablations confirming the importance of contextual mixing and the two loss terms. This approach provides a scalable path toward dense pixel-semantic alignment in open-world scenarios using only text supervision.

Abstract

Recently, semantic segmentation models trained with image-level text supervision have shown promising results in challenging open-world scenarios. However, these models still face difficulties in learning fine-grained semantic alignment at the pixel level and predicting accurate object masks. To address this issue, we propose MixReorg, a novel and straightforward pre-training paradigm for semantic segmentation that enhances a model's ability to reorganize patches mixed across images, exploring both local visual relevance and global semantic coherence. Our approach involves generating fine-grained patch-text pairs data by mixing image patches while preserving the correspondence between patches and text. The model is then trained to minimize the segmentation loss of the mixed images and the two contrastive losses of the original and restored features. With MixReorg as a mask learner, conventional text-supervised semantic segmentation models can achieve highly generalizable pixel-semantic alignment ability, which is crucial for open-world segmentation. After training with large-scale image-text data, MixReorg models can be applied directly to segment visual objects of arbitrary categories, without the need for further fine-tuning. Our proposed framework demonstrates strong performance on popular zero-shot semantic segmentation benchmarks, outperforming GroupViT by significant margins of 5.0%, 6.2%, 2.5%, and 3.4% mIoU on PASCAL VOC2012, PASCAL Context, MS COCO, and ADE20K, respectively.

MixReorg: Cross-Modal Mixed Patch Reorganization is a Good Mask Learner for Open-World Semantic Segmentation

TL;DR

This work tackles open-world semantic segmentation by leveraging image-text supervision to achieve fine-grained pixel-text alignment. It introduces MixReorg, a cross-modal mixed patch reorganization framework that constructs patch-text data from image-text pairs via contextual and progressive mixing, plus a mixing restoration stage. The training objective combines a mixed-image segmentation loss with a restoration-based, cross-modal contrastive loss and a cross-modal original-image-to-text loss, yielding strong zero-shot performance on PASCAL VOC, PASCAL Context, COCO, and ADE20K. Empirical results show substantial improvements over GroupViT and other zero-shot baselines, with ablations confirming the importance of contextual mixing and the two loss terms. This approach provides a scalable path toward dense pixel-semantic alignment in open-world scenarios using only text supervision.

Abstract

Recently, semantic segmentation models trained with image-level text supervision have shown promising results in challenging open-world scenarios. However, these models still face difficulties in learning fine-grained semantic alignment at the pixel level and predicting accurate object masks. To address this issue, we propose MixReorg, a novel and straightforward pre-training paradigm for semantic segmentation that enhances a model's ability to reorganize patches mixed across images, exploring both local visual relevance and global semantic coherence. Our approach involves generating fine-grained patch-text pairs data by mixing image patches while preserving the correspondence between patches and text. The model is then trained to minimize the segmentation loss of the mixed images and the two contrastive losses of the original and restored features. With MixReorg as a mask learner, conventional text-supervised semantic segmentation models can achieve highly generalizable pixel-semantic alignment ability, which is crucial for open-world segmentation. After training with large-scale image-text data, MixReorg models can be applied directly to segment visual objects of arbitrary categories, without the need for further fine-tuning. Our proposed framework demonstrates strong performance on popular zero-shot semantic segmentation benchmarks, outperforming GroupViT by significant margins of 5.0%, 6.2%, 2.5%, and 3.4% mIoU on PASCAL VOC2012, PASCAL Context, MS COCO, and ADE20K, respectively.
Paper Structure (13 sections, 14 equations, 7 figures, 6 tables)

This paper contains 13 sections, 14 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Comparison between GroupViT xu2022groupvit and MixReorg. (a) GroupViT obtains image segmentation implicitly from image-text pairs to achieve cross-modal semantic alignment. (b) MixReorg explicitly constructs the fine-grained patch-text pairs data from the image-text pairs for free by mixing the patches from different images and preserving the correspondence between patches and text.
  • Figure 2: Visual comparison between MixReorg and GroupViT xu2022groupvit on images from the network. Our method can better handle open-world classes for segmentation task.
  • Figure 3: The training pipeline and framework of MixReorg (take two images as an example). MixReorg's image encoder can be divided into three stages: (a) contextual mixing stage: a set of additional patch-text pairs with known segmentation mask is obtained by randomly mixing contextual patches from different images; (b) progressive mixing stage: the original image features are used to enhance the global information of the mixed image features after mixing; (c) mixing restoration stage: the original features, mixed features, and restored features are segmented through a two-stage grouping block xu2022groupvit, and the corresponding segment tokens are obtained. Note that we omit group tokens in the forward process for simplicity. During testing, MixReorg only needs to execute the original image branch.
  • Figure 4: Cross-modal mixed patch reorganization, which combines attention maps and segmentation tokens from the image encoder, and text embeddings to reorganize and predict segmentation masks for mixed images. Where $B_I=M$ means that every $B_I$ images are mixed. For simplicity, we take a mixed image generated by mixing two images as an example.
  • Figure 5: On COCO, MixReorg's ablation study on the number of progressive mixings and the number of images for the contextual mixing operation. (a) Yellow line: Ablation study on the number $P$ of the progressive mixing modules. We replace one progressive mixing module with one transformer layer to maintain the model size. (b) Red line: Ablation study on the number $M$ of images for each contextual mixing operation.
  • ...and 2 more figures