v-CLR: View-Consistent Learning for Open-World Instance Segmentation
Chang-Bin Zhang, Jinhong Ni, Yujie Zhong, Kai Han
TL;DR
This work tackles open-world instance segmentation by addressing texture-driven appearance bias. It introduces view-Consistent LeaRning (v-CLR), a DETR-like framework with two branches (natural and transformed views) and an EMA teacher, trained with a cross-view matching loss $L_{match}$ and a segmentation loss $L_{gt}$, while leveraging CutLER object proposals to ensure object-centric supervision and a cosine-based appearance-consistency loss $L_{sim}$ across transformed views. The approach uses colorized-depth and auxiliary transformations to create appearance-invariant views, compelling the model to learn robust, object-centered features that transfer to novel object categories. Extensive experiments across cross-category and cross-dataset benchmarks demonstrate state-of-the-art performance, improved generalization to unseen textures, and strong qualitative results in discovering unknown objects. Overall, v-CLR offers a principled, scalable path toward robust open-world instance segmentation by decoupling appearance from object identity and grounding learning with reliable object proposals.
Abstract
In this paper, we address the challenging problem of open-world instance segmentation. Existing works have shown that vanilla visual networks are biased toward learning appearance information, \eg texture, to recognize objects. This implicit bias causes the model to fail in detecting novel objects with unseen textures in the open-world setting. To address this challenge, we propose a learning framework, called view-Consistent LeaRning (v-CLR), which aims to enforce the model to learn appearance-invariant representations for robust instance segmentation. In v-CLR, we first introduce additional views for each image, where the texture undergoes significant alterations while preserving the image's underlying structure. We then encourage the model to learn the appearance-invariant representation by enforcing the consistency between object features across different views, for which we obtain class-agnostic object proposals using off-the-shelf unsupervised models that possess strong object-awareness. These proposals enable cross-view object feature matching, greatly reducing the appearance dependency while enhancing the object-awareness. We thoroughly evaluate our method on public benchmarks under both cross-class and cross-dataset settings, achieving state-of-the-art performance. Project page: https://visual-ai.github.io/vclr
