Table of Contents
Fetching ...

v-CLR: View-Consistent Learning for Open-World Instance Segmentation

Chang-Bin Zhang, Jinhong Ni, Yujie Zhong, Kai Han

TL;DR

This work tackles open-world instance segmentation by addressing texture-driven appearance bias. It introduces view-Consistent LeaRning (v-CLR), a DETR-like framework with two branches (natural and transformed views) and an EMA teacher, trained with a cross-view matching loss $L_{match}$ and a segmentation loss $L_{gt}$, while leveraging CutLER object proposals to ensure object-centric supervision and a cosine-based appearance-consistency loss $L_{sim}$ across transformed views. The approach uses colorized-depth and auxiliary transformations to create appearance-invariant views, compelling the model to learn robust, object-centered features that transfer to novel object categories. Extensive experiments across cross-category and cross-dataset benchmarks demonstrate state-of-the-art performance, improved generalization to unseen textures, and strong qualitative results in discovering unknown objects. Overall, v-CLR offers a principled, scalable path toward robust open-world instance segmentation by decoupling appearance from object identity and grounding learning with reliable object proposals.

Abstract

In this paper, we address the challenging problem of open-world instance segmentation. Existing works have shown that vanilla visual networks are biased toward learning appearance information, \eg texture, to recognize objects. This implicit bias causes the model to fail in detecting novel objects with unseen textures in the open-world setting. To address this challenge, we propose a learning framework, called view-Consistent LeaRning (v-CLR), which aims to enforce the model to learn appearance-invariant representations for robust instance segmentation. In v-CLR, we first introduce additional views for each image, where the texture undergoes significant alterations while preserving the image's underlying structure. We then encourage the model to learn the appearance-invariant representation by enforcing the consistency between object features across different views, for which we obtain class-agnostic object proposals using off-the-shelf unsupervised models that possess strong object-awareness. These proposals enable cross-view object feature matching, greatly reducing the appearance dependency while enhancing the object-awareness. We thoroughly evaluate our method on public benchmarks under both cross-class and cross-dataset settings, achieving state-of-the-art performance. Project page: https://visual-ai.github.io/vclr

v-CLR: View-Consistent Learning for Open-World Instance Segmentation

TL;DR

This work tackles open-world instance segmentation by addressing texture-driven appearance bias. It introduces view-Consistent LeaRning (v-CLR), a DETR-like framework with two branches (natural and transformed views) and an EMA teacher, trained with a cross-view matching loss and a segmentation loss , while leveraging CutLER object proposals to ensure object-centric supervision and a cosine-based appearance-consistency loss across transformed views. The approach uses colorized-depth and auxiliary transformations to create appearance-invariant views, compelling the model to learn robust, object-centered features that transfer to novel object categories. Extensive experiments across cross-category and cross-dataset benchmarks demonstrate state-of-the-art performance, improved generalization to unseen textures, and strong qualitative results in discovering unknown objects. Overall, v-CLR offers a principled, scalable path toward robust open-world instance segmentation by decoupling appearance from object identity and grounding learning with reliable object proposals.

Abstract

In this paper, we address the challenging problem of open-world instance segmentation. Existing works have shown that vanilla visual networks are biased toward learning appearance information, \eg texture, to recognize objects. This implicit bias causes the model to fail in detecting novel objects with unseen textures in the open-world setting. To address this challenge, we propose a learning framework, called view-Consistent LeaRning (v-CLR), which aims to enforce the model to learn appearance-invariant representations for robust instance segmentation. In v-CLR, we first introduce additional views for each image, where the texture undergoes significant alterations while preserving the image's underlying structure. We then encourage the model to learn the appearance-invariant representation by enforcing the consistency between object features across different views, for which we obtain class-agnostic object proposals using off-the-shelf unsupervised models that possess strong object-awareness. These proposals enable cross-view object feature matching, greatly reducing the appearance dependency while enhancing the object-awareness. We thoroughly evaluate our method on public benchmarks under both cross-class and cross-dataset settings, achieving state-of-the-art performance. Project page: https://visual-ai.github.io/vclr

Paper Structure

This paper contains 18 sections, 4 equations, 8 figures, 12 tables, 1 algorithm.

Figures (8)

  • Figure 1: Toy example on the CLEVR clevr dataset. The model regards red-metal objects as the known class and is evaluated on different subsets in terms of AR@10. We train the model with and without incorporating depth image data, respectively. The prediction results are displayed in the middle row.
  • Figure 2: Illustration of v-CLR. Our learning framework consists of two branches, the natural image branch (top) and the transformed image branch (bottom). Both branches adopt transformers to make predictions, which are then matched with the object proposals to obtain optimized object queries. We compute a matching loss $L_{match}$ which enforces the matched object-oriented query pairs from the two branches to be similar. We finally compute the ordinary segmentation loss $L_{gt}$ using the ground truth labels. The transformer in the natural image branch is updated as an EMA model of the transformed image branch.
  • Figure 3: Illustration of object feature matching in v-CLR. Let $\mathcal{Q}_1$ and $\mathcal{Q}_2$ represent the query outputs from the EMA teacher model and the student model, respectively. Predictions associated with object proposals demonstrating poor localization quality are removed, resulting in paired $\hat{\mathcal{Q}_1}$ and $\hat{\mathcal{Q}_2}$, and the objective $L_{sim}$ is utilized to maximize feature similarity between each pair. Concurrently, the student model is trained using these object proposals.
  • Figure 4: Qualitative results of our method on COCO 2017 validation set. The model is trained on 20 VOC classes. We show the top-10 predicted instances according to the prediction confidence.
  • Figure 5: Visualization of three views used in our method, natural, art-stylized, and colorized depth images, respectively.
  • ...and 3 more figures