Table of Contents
Fetching ...

DSV-LFS: Unifying LLM-Driven Semantic Cues with Visual Features for Robust Few-Shot Segmentation

Amin Karimi, Charalambos Poullis

TL;DR

DSV-LFS addresses generalization in few-shot semantic segmentation by unifying LLM-driven semantic cues with dense visual matching. It introduces a novel SEM_prompt token in a multimodal LLM to tailor class descriptions to the query image, and a Dense Matching Module that generates a VIS_prompt from 4D hypercorrelations. A prompt-based decoder fuses both prompts with query features to produce masks in a single stage, trained with text and mask losses. Experiments on Pascal-5^i and COCO-20^i demonstrate state-of-the-art performance, including strong cross-domain transfer, and code is released.

Abstract

Few-shot semantic segmentation (FSS) aims to enable models to segment novel/unseen object classes using only a limited number of labeled examples. However, current FSS methods frequently struggle with generalization due to incomplete and biased feature representations, especially when support images do not capture the full appearance variability of the target class. To improve the FSS pipeline, we propose a novel framework that utilizes large language models (LLMs) to adapt general class semantic information to the query image. Furthermore, the framework employs dense pixel-wise matching to identify similarities between query and support images, resulting in enhanced FSS performance. Inspired by reasoning-based segmentation frameworks, our method, named DSV-LFS, introduces an additional token into the LLM vocabulary, allowing a multimodal LLM to generate a "semantic prompt" from class descriptions. In parallel, a dense matching module identifies visual similarities between the query and support images, generating a "visual prompt". These prompts are then jointly employed to guide the prompt-based decoder for accurate segmentation of the query image. Comprehensive experiments on the benchmark datasets Pascal-$5^{i}$ and COCO-$20^{i}$ demonstrate that our framework achieves state-of-the-art performance-by a significant margin-demonstrating superior generalization to novel classes and robustness across diverse scenarios. The source code is available at \href{https://github.com/aminpdik/DSV-LFS}{https://github.com/aminpdik/DSV-LFS}

DSV-LFS: Unifying LLM-Driven Semantic Cues with Visual Features for Robust Few-Shot Segmentation

TL;DR

DSV-LFS addresses generalization in few-shot semantic segmentation by unifying LLM-driven semantic cues with dense visual matching. It introduces a novel SEM_prompt token in a multimodal LLM to tailor class descriptions to the query image, and a Dense Matching Module that generates a VIS_prompt from 4D hypercorrelations. A prompt-based decoder fuses both prompts with query features to produce masks in a single stage, trained with text and mask losses. Experiments on Pascal-5^i and COCO-20^i demonstrate state-of-the-art performance, including strong cross-domain transfer, and code is released.

Abstract

Few-shot semantic segmentation (FSS) aims to enable models to segment novel/unseen object classes using only a limited number of labeled examples. However, current FSS methods frequently struggle with generalization due to incomplete and biased feature representations, especially when support images do not capture the full appearance variability of the target class. To improve the FSS pipeline, we propose a novel framework that utilizes large language models (LLMs) to adapt general class semantic information to the query image. Furthermore, the framework employs dense pixel-wise matching to identify similarities between query and support images, resulting in enhanced FSS performance. Inspired by reasoning-based segmentation frameworks, our method, named DSV-LFS, introduces an additional token into the LLM vocabulary, allowing a multimodal LLM to generate a "semantic prompt" from class descriptions. In parallel, a dense matching module identifies visual similarities between the query and support images, generating a "visual prompt". These prompts are then jointly employed to guide the prompt-based decoder for accurate segmentation of the query image. Comprehensive experiments on the benchmark datasets Pascal- and COCO- demonstrate that our framework achieves state-of-the-art performance-by a significant margin-demonstrating superior generalization to novel classes and robustness across diverse scenarios. The source code is available at \href{https://github.com/aminpdik/DSV-LFS}{https://github.com/aminpdik/DSV-LFS}

Paper Structure

This paper contains 20 sections, 11 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Technical Overview. The large language model (LLM) first generates a class description $W_{C}$ based on an input prompt, which consists of a simple question regarding the visual features that distinctly define the class $C$ with label $\xi$. The $\{ImageToken\}$ in $W_{C}$ serves as a default token assigned to the query image, and $\{Class\}$ refers to the class label $\xi$. This class description, along with the query image, is then fed into a multi-modal LLM $(\mathcal{F})$ to produce a class-specific semantic prompt $SEM^{f}_{prompt}$. In parallel, a dense matching module $\mathcal{F}_{enc}^{4D}$, $\mathcal{F}_{dec}^{4D}$, generates a class-specific visual prompt $VIS_{prompt}$ by using the support and query feature maps obtained from the vision backbone encoder $\mathcal{F}_{enc}$. Finally, these two prompts, together with the query feature maps, are passed to the prompt-based decoder $\mathcal{F}_{dec}$ to produce the final segmentation.
  • Figure 2: Qualitative results. Examples of our method's performance on the COCO-$20^{i}$ dataset. Each column represents an episode, displaying the support image, query image, and predicted segmentation output from top to bottom. The episodes illustrate the model's ability to handle challenges such as the presence of base classes in the query image (e.g., person in motorcycle and train classes) and variations between target objects in support and query images, including scale differences (e.g., handbag), occlusion (e.g., laptop), appearance changes (e.g., potted plant), complex backgrounds (e.g., bird), and deformations (e.g., fire hydrant).
  • Figure :
  • Figure :
  • Figure :
  • ...and 3 more figures