Table of Contents
Fetching ...

CORA: Consistency-Guided Semi-Supervised Framework for Reasoning Segmentation

Prantik Howlader, Hoang Nguyen-Canh, Srijan Das, Jingyi Xu, Hieu Le, Dimitris Samaras

TL;DR

Reasoning segmentation seeks pixel-accurate masks for targets described by open-ended language, but obtaining diverse, high-quality supervision is costly. The paper introduces CORA, a semi-supervised framework that combines conditional visual instructions, output-consistency driven pseudo-label refinement, and token-level feature alignment to learn from limited labels and large unlabeled datasets. Across Cityscapes and PanNuke, CORA achieves state-of-the-art results under low-label regimes, demonstrating robustness to distribution shifts and to pseudo-label noise. This approach reduces annotation burden while enabling reliable reasoning-based segmentation in real-world domains such as autonomous driving and medical imaging.

Abstract

Reasoning segmentation seeks pixel-accurate masks for targets referenced by complex, often implicit instructions, requiring context-dependent reasoning over the scene. Recent multimodal language models have advanced instruction following segmentation, yet generalization remains limited. The key bottleneck is the high cost of curating diverse, high-quality pixel annotations paired with rich linguistic supervision leading to brittle performance under distribution shift. Therefore, we present CORA, a semi-supervised reasoning segmentation framework that jointly learns from limited labeled data and a large corpus of unlabeled images. CORA introduces three main components: 1) conditional visual instructions that encode spatial and contextual relationships between objects; 2) a noisy pseudo-label filter based on the consistency of Multimodal LLM's outputs across semantically equivalent queries; and 3) a token-level contrastive alignment between labeled and pseudo-labeled samples to enhance feature consistency. These components enable CORA to perform robust reasoning segmentation with minimal supervision, outperforming existing baselines under constrained annotation settings. CORA achieves state-of-the-art results, requiring as few as 100 labeled images on Cityscapes, a benchmark dataset for urban scene understanding, surpassing the baseline by $+2.3\%$. Similarly, CORA improves performance by $+2.4\%$ with only 180 labeled images on PanNuke, a histopathology dataset.

CORA: Consistency-Guided Semi-Supervised Framework for Reasoning Segmentation

TL;DR

Reasoning segmentation seeks pixel-accurate masks for targets described by open-ended language, but obtaining diverse, high-quality supervision is costly. The paper introduces CORA, a semi-supervised framework that combines conditional visual instructions, output-consistency driven pseudo-label refinement, and token-level feature alignment to learn from limited labels and large unlabeled datasets. Across Cityscapes and PanNuke, CORA achieves state-of-the-art results under low-label regimes, demonstrating robustness to distribution shifts and to pseudo-label noise. This approach reduces annotation burden while enabling reliable reasoning-based segmentation in real-world domains such as autonomous driving and medical imaging.

Abstract

Reasoning segmentation seeks pixel-accurate masks for targets referenced by complex, often implicit instructions, requiring context-dependent reasoning over the scene. Recent multimodal language models have advanced instruction following segmentation, yet generalization remains limited. The key bottleneck is the high cost of curating diverse, high-quality pixel annotations paired with rich linguistic supervision leading to brittle performance under distribution shift. Therefore, we present CORA, a semi-supervised reasoning segmentation framework that jointly learns from limited labeled data and a large corpus of unlabeled images. CORA introduces three main components: 1) conditional visual instructions that encode spatial and contextual relationships between objects; 2) a noisy pseudo-label filter based on the consistency of Multimodal LLM's outputs across semantically equivalent queries; and 3) a token-level contrastive alignment between labeled and pseudo-labeled samples to enhance feature consistency. These components enable CORA to perform robust reasoning segmentation with minimal supervision, outperforming existing baselines under constrained annotation settings. CORA achieves state-of-the-art results, requiring as few as 100 labeled images on Cityscapes, a benchmark dataset for urban scene understanding, surpassing the baseline by . Similarly, CORA improves performance by with only 180 labeled images on PanNuke, a histopathology dataset.

Paper Structure

This paper contains 21 sections, 9 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Training reasoning segmentation systems with semi-supervised semantic segmentation supervision. The second row shows results from LISA lai2024lisa trained on 100 labeled images, while the third row shows results from CORA (Ours) on 100 labeled images and 2,875 unlabeled images. We note that unlike LISA lai2024lisa, which is limited to fully supervised settings, our method effectively leverages 2,875 unlabeled images alongside just 100 labeled examples, demonstrating that unlabeled data can significantly enhance reasoning-based segmentation performance.
  • Figure 2: Illustration of Conditional-relationship Visual Instruction Set used for training:Target object segmentation conditioned on its contextual relationship with the reference object (Anchor).
  • Figure 3: Framework of our approach leveraging unlabeled images for reasoning segmentation CORA is trained on unlabeled images using pseudo-labels from a pretrained semi-supervised segmentation model, with output consistency from a multi-modal LLM used to reduce pseudo-label noise.
  • Figure 4: Token-level Feature Consistency Alignment. Minimize the distance between the unlabeled token feature (<SSEG$u$$c$>) and the same-class labeled token (<SSEG$l$$c$>), while maximizing its distance from different-class labeled tokens (<SSEG$l$$b$>)
  • Figure 5: Pipeline for generating our dataset for training CORA Given an image, two random objects are selected as target and anchor from the segmentation mask. Descriptions of each are generated using the image and their mask polygons. These textual descriptions, together with the polygons, are then used to create the segmentation instruction set. The system prompt to generate the conditional visual instruction is in Supplementary (Section 7.2)
  • ...and 2 more figures