Table of Contents
Fetching ...

Spider: A Unified Framework for Context-dependent Concept Segmentation

Xiaoqi Zhao, Youwei Pang, Wei Ji, Baicheng Sheng, Jiaming Zuo, Lihe Zhang, Huchuan Lu

TL;DR

Spider introduces a unified, parameter-sharing framework for context-dependent concept segmentation across eight tasks by leveraging image-mask group prompts to form a high-level concept filter that modulates a shared segmentation backbone. The model combines a segmentation stream with prompt streams and a dynamic head, enabling cross-domain CD understanding without task-specific heads. A Balance FP - Unify BP training strategy and clustering-based prompt selection during inference support robust learning and continual adaptation, with fine-tuning on new tasks using less than 1% of parameters and minimal degradation on old tasks. Empirical results show state-of-the-art performance across natural and medical CD segmentation tasks, highlighting the approach's potential for cross-domain CD concept understanding and future extensions to video and editing applications.

Abstract

Different from the context-independent (CI) concepts such as human, car, and airplane, context-dependent (CD) concepts require higher visual understanding ability, such as camouflaged object and medical lesion. Despite the rapid advance of many CD understanding tasks in respective branches, the isolated evolution leads to their limited cross-domain generalisation and repetitive technique innovation. Since there is a strong coupling relationship between foreground and background context in CD tasks, existing methods require to train separate models in their focused domains. This restricts their real-world CD concept understanding towards artificial general intelligence (AGI). We propose a unified model with a single set of parameters, Spider, which only needs to be trained once. With the help of the proposed concept filter driven by the image-mask group prompt, Spider is able to understand and distinguish diverse strong context-dependent concepts to accurately capture the Prompter's intention. Without bells and whistles, Spider significantly outperforms the state-of-the-art specialized models in 8 different context-dependent segmentation tasks, including 4 natural scenes (salient, camouflaged, and transparent objects and shadow) and 4 medical lesions (COVID-19, polyp, breast, and skin lesion with color colonoscopy, CT, ultrasound, and dermoscopy modalities). Besides, Spider shows obvious advantages in continuous learning. It can easily complete the training of new tasks by fine-tuning parameters less than 1\% and bring a tolerable performance degradation of less than 5\% for all old tasks. The source code will be publicly available at \href{https://github.com/Xiaoqi-Zhao-DLUT/Spider-UniCDSeg}{Spider-UniCDSeg}.

Spider: A Unified Framework for Context-dependent Concept Segmentation

TL;DR

Spider introduces a unified, parameter-sharing framework for context-dependent concept segmentation across eight tasks by leveraging image-mask group prompts to form a high-level concept filter that modulates a shared segmentation backbone. The model combines a segmentation stream with prompt streams and a dynamic head, enabling cross-domain CD understanding without task-specific heads. A Balance FP - Unify BP training strategy and clustering-based prompt selection during inference support robust learning and continual adaptation, with fine-tuning on new tasks using less than 1% of parameters and minimal degradation on old tasks. Empirical results show state-of-the-art performance across natural and medical CD segmentation tasks, highlighting the approach's potential for cross-domain CD concept understanding and future extensions to video and editing applications.

Abstract

Different from the context-independent (CI) concepts such as human, car, and airplane, context-dependent (CD) concepts require higher visual understanding ability, such as camouflaged object and medical lesion. Despite the rapid advance of many CD understanding tasks in respective branches, the isolated evolution leads to their limited cross-domain generalisation and repetitive technique innovation. Since there is a strong coupling relationship between foreground and background context in CD tasks, existing methods require to train separate models in their focused domains. This restricts their real-world CD concept understanding towards artificial general intelligence (AGI). We propose a unified model with a single set of parameters, Spider, which only needs to be trained once. With the help of the proposed concept filter driven by the image-mask group prompt, Spider is able to understand and distinguish diverse strong context-dependent concepts to accurately capture the Prompter's intention. Without bells and whistles, Spider significantly outperforms the state-of-the-art specialized models in 8 different context-dependent segmentation tasks, including 4 natural scenes (salient, camouflaged, and transparent objects and shadow) and 4 medical lesions (COVID-19, polyp, breast, and skin lesion with color colonoscopy, CT, ultrasound, and dermoscopy modalities). Besides, Spider shows obvious advantages in continuous learning. It can easily complete the training of new tasks by fine-tuning parameters less than 1\% and bring a tolerable performance degradation of less than 5\% for all old tasks. The source code will be publicly available at \href{https://github.com/Xiaoqi-Zhao-DLUT/Spider-UniCDSeg}{Spider-UniCDSeg}.
Paper Structure (24 sections, 2 equations, 26 figures, 6 tables, 1 algorithm)

This paper contains 24 sections, 2 equations, 26 figures, 6 tables, 1 algorithm.

Figures (26)

  • Figure 1: Eight different segmentation tasks with context-dependent concepts are unified into our Spider model. With the interlaced concepts within task domains and class semantic space, Spider can wander to any target of interest.
  • Figure 2: Two types of feature interaction between visual prompts and current image input. The left is used in Universeg UniverSeg with multiple foreground prompts and others SegGPTSAMclip_seg2 with a single foreground prompt. The right is ours.
  • Figure 3: Visual comparison of segmentation objects with context-independent concepts and context-dependent concepts. The odd rows are the pure foregrounds and the even rows are the complete images with the highlighted foregrounds.
  • Figure 4: Overall pipeline. It consists of segmentation stream $\mathcal{S}_{s}$, and image- and mask- group prompt streams $\mathcal{S}_{i}$ and $\mathcal{S}_{m}$. $\mathcal{S}_{s}$ uses the encoder-decoder structure. $\mathcal{S}_{i}$ is fed into the frozen pre-trained encoder and output the group prompt feature $F_{mem}$ as the key and value of the transformer decoder. $\mathcal{S}_{m}$ generates the foreground-aware and background-aware queries by masked average pooling on the group prompt features $F_{mem}$. A series of concept filters $<W_{obj}, b_{ctx}>$ act on the last layer of the decoder to generate dynamic prediction.
  • Figure 5: Illustration of generating concept filters.
  • ...and 21 more figures