Table of Contents
Fetching ...

What You Perceive Is What You Conceive: A Cognition-Inspired Framework for Open Vocabulary Image Segmentation

Jianghang Lin, Yue Hu, Jiangtao Shen, Yunhang Shen, Liujuan Cao, Shengchuan Zhang, Rongrong Ji

TL;DR

This work tackles open vocabulary image segmentation by addressing semantic misalignment between region proposals and target concepts. It introduces a Cognition-Inspired Framework that first generates object concepts with a Generative Vision-Language Model (G-VLM), then enhances global visual representations through a Concept-Aware Visual Enhancer (CAVE), and finally decodes masks with a Cognition-Inspired Decoder (CID) under two inference modes. The approach yields state-of-the-art or competitive results across multiple benchmarks, including strong vocabulary-free performance and cross-domain robustness, while enabling vocabulary-free segmentation. By emulating Conceive-before-Perceive reasoning and grounding segmentation in semantic concepts, the framework offers flexible open vocabulary segmentation without relying on predefined vocabularies, with practical implications for real-world scalable scene understanding.

Abstract

Open vocabulary image segmentation tackles the challenge of recognizing dynamically adjustable, predefined novel categories at inference time by leveraging vision-language alignment. However, existing paradigms typically perform class-agnostic region segmentation followed by category matching, which deviates from the human visual system's process of recognizing objects based on semantic concepts, leading to poor alignment between region segmentation and target concepts. To bridge this gap, we propose a novel Cognition-Inspired Framework for open vocabulary image segmentation that emulates the human visual recognition process: first forming a conceptual understanding of an object, then perceiving its spatial extent. The framework consists of three core components: (1) A Generative Vision-Language Model (G-VLM) that mimics human cognition by generating object concepts to provide semantic guidance for region segmentation. (2) A Concept-Aware Visual Enhancer Module that fuses textual concept features with global visual representations, enabling adaptive visual perception based on target concepts. (3) A Cognition-Inspired Decoder that integrates local instance features with G-VLM-provided semantic cues, allowing selective classification over a subset of relevant categories. Extensive experiments demonstrate that our framework achieves significant improvements, reaching $27.2$ PQ, $17.0$ mAP, and $35.3$ mIoU on A-150. It further attains $56.2$, $28.2$, $15.4$, $59.2$, $18.7$, and $95.8$ mIoU on Cityscapes, Mapillary Vistas, A-847, PC-59, PC-459, and PAS-20, respectively. In addition, our framework supports vocabulary-free segmentation, offering enhanced flexibility in recognizing unseen categories. Code will be public.

What You Perceive Is What You Conceive: A Cognition-Inspired Framework for Open Vocabulary Image Segmentation

TL;DR

This work tackles open vocabulary image segmentation by addressing semantic misalignment between region proposals and target concepts. It introduces a Cognition-Inspired Framework that first generates object concepts with a Generative Vision-Language Model (G-VLM), then enhances global visual representations through a Concept-Aware Visual Enhancer (CAVE), and finally decodes masks with a Cognition-Inspired Decoder (CID) under two inference modes. The approach yields state-of-the-art or competitive results across multiple benchmarks, including strong vocabulary-free performance and cross-domain robustness, while enabling vocabulary-free segmentation. By emulating Conceive-before-Perceive reasoning and grounding segmentation in semantic concepts, the framework offers flexible open vocabulary segmentation without relying on predefined vocabularies, with practical implications for real-world scalable scene understanding.

Abstract

Open vocabulary image segmentation tackles the challenge of recognizing dynamically adjustable, predefined novel categories at inference time by leveraging vision-language alignment. However, existing paradigms typically perform class-agnostic region segmentation followed by category matching, which deviates from the human visual system's process of recognizing objects based on semantic concepts, leading to poor alignment between region segmentation and target concepts. To bridge this gap, we propose a novel Cognition-Inspired Framework for open vocabulary image segmentation that emulates the human visual recognition process: first forming a conceptual understanding of an object, then perceiving its spatial extent. The framework consists of three core components: (1) A Generative Vision-Language Model (G-VLM) that mimics human cognition by generating object concepts to provide semantic guidance for region segmentation. (2) A Concept-Aware Visual Enhancer Module that fuses textual concept features with global visual representations, enabling adaptive visual perception based on target concepts. (3) A Cognition-Inspired Decoder that integrates local instance features with G-VLM-provided semantic cues, allowing selective classification over a subset of relevant categories. Extensive experiments demonstrate that our framework achieves significant improvements, reaching PQ, mAP, and mIoU on A-150. It further attains , , , , , and mIoU on Cityscapes, Mapillary Vistas, A-847, PC-59, PC-459, and PAS-20, respectively. In addition, our framework supports vocabulary-free segmentation, offering enhanced flexibility in recognizing unseen categories. Code will be public.

Paper Structure

This paper contains 18 sections, 7 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: (a) shows traditional open vocabulary segmentation framework, which relies on human-predefined categories; (b) shows our cognition-inspired open vocabulary segmentation framework, which introduces Generative Visual-Language Model (G-VLM) to generate concepts and achieves generalized segmentation capabilities through enhancement modules.
  • Figure 2: In the traditional open vocabulary image segmentation paradigm, the mask embedding of objects is forced to approximate the few seen category embeddings in the training set, which makes it difficult for the model to accurately distinguish huge novel objects when inferring about them (All categories in the test set). We simply take a subset of the ground truth (GT) categories that only match this image for each region of the image, which can significantly improve the image segmentation performance of the model.
  • Figure 3: Overview of our Cognition-Inspired Framework. This framework follows the process of human visual recognition, first Conceiving, then Perceiving, which can be summarized as What You Perceive Is What You Conceive. It first used simple prompt $\mathbf{P}$ to prompt G-VLM to generate the concepts $\mathcal{T}$ with confidence $\mathcal{C}$ . D-VLM Text Encoder encodes $\mathcal{T}$ into categories embedding $\mathcal{E}_c$. Then $\mathcal{E}_c$ interacts with the global visual features $\mathcal{V}_g$ encoded by D-VLM Vision Encoder in Concept-Aware Visual Enhancer so that $\mathcal{V}_g$ can converge according to semantic concepts. Then in Cognition-Inspired Mask Decoder, $\mathbf{K}$ queries are first fused with $\mathcal{E}_c$ to integrate semantic concepts, and then interact with the enhanced global visual features $\mathcal{V}_{eg}$ to query the objects in the image and generate the corresponding mask $\mathbf{M}$. $\mathbf{M}$ is calculated with $\mathcal{E}_c$ by cosine similarity through the local visual feature mask embedding $\mathcal{E}_m$ of Mask Pooling, and weighted by $\mathcal{C}$ to match the final category prediction.
  • Figure 4: Comparison of Precision and Recall across Different Datasets for Four Vision-Language Models (BLIP2 li2023blip, Llava-NeXT li2024llavanext-ablations, Qwen2.5-VL bai2025qwen2, and RAM zhang2023recognize).
  • Figure 5: Visualization of K-means clustering of $\mathcal{V}_g$, $\mathcal{V}_{sa}$ without our Concept-Aware Visual Enhancer (replace it with Pixel Decoder cheng2022masked) and $\mathcal{V}_{sa}$ with our Concept-Aware Visual Enhancer.