Table of Contents
Fetching ...

INT: Instance-Specific Negative Mining for Task-Generic Promptable Segmentation

Jian Hu, Zixu Cheng, Shaogang Gong

TL;DR

The paper tackles the problem of segmenting diverse images with a single task-generic prompt by introducing INT, a training-free test-time adaptation framework that progressively refines instance-specific prompts and semantic masks. It comprises two main components: instance-specific prompt generation, which uses patch-based hallucinations and inpainting-driven contrast to select plausible prompts by measuring changes in VLM outputs, and semantic mask generation, which fuses GroundingDINO, SAM, and Spatial CLIP to produce semantically aligned masks that are iteratively refined and averaged. The approach demonstrates strong performance across six datasets, including camouflaged object detection and medical image segmentation, outperforming several baselines that rely on manual prompts or weaker supervision and highlighting the value of progressive negative mining in reducing erroneous prompts. The work presents a practical, annotation-free strategy for robust promptable segmentation with potential impact on real-world segmentation tasks where labeled data are scarce.

Abstract

Task-generic promptable image segmentation aims to achieve segmentation of diverse samples under a single task description by utilizing only one task-generic prompt. Current methods leverage the generalization capabilities of Vision-Language Models (VLMs) to infer instance-specific prompts from these task-generic prompts in order to guide the segmentation process. However, when VLMs struggle to generalise to some image instances, predicting instance-specific prompts becomes poor. To solve this problem, we introduce \textbf{I}nstance-specific \textbf{N}egative Mining for \textbf{T}ask-Generic Promptable Segmentation (\textbf{INT}). The key idea of INT is to adaptively reduce the influence of irrelevant (negative) prior knowledge whilst to increase the use the most plausible prior knowledge, selected by negative mining with higher contrast, in order to optimise instance-specific prompts generation. Specifically, INT consists of two components: (1) instance-specific prompt generation, which progressively fliters out incorrect information in prompt generation; (2) semantic mask generation, which ensures each image instance segmentation matches correctly the semantics of the instance-specific prompts. INT is validated on six datasets, including camouflaged objects and medical images, demonstrating its effectiveness, robustness and scalability.

INT: Instance-Specific Negative Mining for Task-Generic Promptable Segmentation

TL;DR

The paper tackles the problem of segmenting diverse images with a single task-generic prompt by introducing INT, a training-free test-time adaptation framework that progressively refines instance-specific prompts and semantic masks. It comprises two main components: instance-specific prompt generation, which uses patch-based hallucinations and inpainting-driven contrast to select plausible prompts by measuring changes in VLM outputs, and semantic mask generation, which fuses GroundingDINO, SAM, and Spatial CLIP to produce semantically aligned masks that are iteratively refined and averaged. The approach demonstrates strong performance across six datasets, including camouflaged object detection and medical image segmentation, outperforming several baselines that rely on manual prompts or weaker supervision and highlighting the value of progressive negative mining in reducing erroneous prompts. The work presents a practical, annotation-free strategy for robust promptable segmentation with potential impact on real-world segmentation tasks where labeled data are scarce.

Abstract

Task-generic promptable image segmentation aims to achieve segmentation of diverse samples under a single task description by utilizing only one task-generic prompt. Current methods leverage the generalization capabilities of Vision-Language Models (VLMs) to infer instance-specific prompts from these task-generic prompts in order to guide the segmentation process. However, when VLMs struggle to generalise to some image instances, predicting instance-specific prompts becomes poor. To solve this problem, we introduce \textbf{I}nstance-specific \textbf{N}egative Mining for \textbf{T}ask-Generic Promptable Segmentation (\textbf{INT}). The key idea of INT is to adaptively reduce the influence of irrelevant (negative) prior knowledge whilst to increase the use the most plausible prior knowledge, selected by negative mining with higher contrast, in order to optimise instance-specific prompts generation. Specifically, INT consists of two components: (1) instance-specific prompt generation, which progressively fliters out incorrect information in prompt generation; (2) semantic mask generation, which ensures each image instance segmentation matches correctly the semantics of the instance-specific prompts. INT is validated on six datasets, including camouflaged objects and medical images, demonstrating its effectiveness, robustness and scalability.

Paper Structure

This paper contains 9 sections, 13 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: (a) Motivation of INT. When task-related objects in the input to the VLM are occluded, the unique features of these objects are also obscured, leading to significant changes in the corresponding VLM output. In contrast, the features of other objects, which are not fully occluded, show only minor changes in the VLM output. We leverage this observation to assess the correctness of the generated instance-specific prompts without the need for ground truth. By progressive negative mining, we iteratively correct difficult-to-identify erroneous prompts. (b) Evaluation of INT. CLIP semantic similarities are compared between the instance-specific prompts INT generated and the ground truth. INT's contrastive negative mining mechanism effectively corrects erroneous samples, ensuring that the generated instance-specific prompts are instance-wise optimised.
  • Figure 2: INT consists of two main components: instance-specific prompt generation and semantic mask generation. Initially, the former uses VLMs to generate candidate instance-specific prompts. A prompt selection module then selects the prompt with the highest VLM output contrast, refined through progressive negative mining. This selected prompt is passed to the semantic mask generation module, which employs GroundingDINO to ensure that all task-relevant samples in the image are collected as comprehensively as possible. Simultaneously, SAM and CLIP work together to ensure that the generated masks are semantically aligned with the task.
  • Figure 3: Visualization of various segmentation methods among various segmentation tasks.