Table of Contents
Fetching ...

VP Lab: a PEFT-Enabled Visual Prompting Laboratory for Semantic Segmentation

Niccolo Avogaro, Thomas Frick, Yagmur G. Cinar, Daniel Caraballo, Cezary Skura, Filip M. Janicki, Piotr Kluska, Brown Ebouky, Nicola Farronato, Florian Scheidegger, Cristiano Malossi, Konrad Schindler, Andrea Bartezzaghi, Roy Assaf, Mattia Rigotti

TL;DR

VP Lab tackles the challenge of adapting large pretrained vision models to domain-specific semantic segmentation without extensive retraining. The method combines an ensemble of parameter-efficient fine-tuning techniques (E-PEFT) with a visual prompting pipeline (SoftMatcher) and a label-refinement loop to enable fast, interactive test-time adaptation of SAM. Key findings show that E-PEFT surpasses state-of-the-art HQ-SAM on Kvasir-Seg and HQ-44k with orders of magnitude fewer trainable parameters and achieves about $50\%$ average mIoU gains in 5-shot settings across multiple technical datasets. The work demonstrates a practical, user-in-the-loop deployment pathway (VP Lab) that delivers production-ready domain-adapted segmentation within minutes, facilitating rapid deployment in medical-imaging and other technical inspection domains.

Abstract

Large-scale pretrained vision backbones have transformed computer vision by providing powerful feature extractors that enable various downstream tasks, including training-free approaches like visual prompting for semantic segmentation. Despite their success in generic scenarios, these models often fall short when applied to specialized technical domains where the visual features differ significantly from their training distribution. To bridge this gap, we introduce VP Lab, a comprehensive iterative framework that enhances visual prompting for robust segmentation model development. At the core of VP Lab lies E-PEFT, a novel ensemble of parameter-efficient fine-tuning techniques specifically designed to adapt our visual prompting pipeline to specific domains in a manner that is both parameter- and data-efficient. Our approach not only surpasses the state-of-the-art in parameter-efficient fine-tuning for the Segment Anything Model (SAM), but also facilitates an interactive, near-real-time loop, allowing users to observe progressively improving results as they experiment within the framework. By integrating E-PEFT with visual prompting, we demonstrate a remarkable 50\% increase in semantic segmentation mIoU performance across various technical datasets using only 5 validated images, establishing a new paradigm for fast, efficient, and interactive model deployment in new, challenging domains. This work comes in the form of a demonstration.

VP Lab: a PEFT-Enabled Visual Prompting Laboratory for Semantic Segmentation

TL;DR

VP Lab tackles the challenge of adapting large pretrained vision models to domain-specific semantic segmentation without extensive retraining. The method combines an ensemble of parameter-efficient fine-tuning techniques (E-PEFT) with a visual prompting pipeline (SoftMatcher) and a label-refinement loop to enable fast, interactive test-time adaptation of SAM. Key findings show that E-PEFT surpasses state-of-the-art HQ-SAM on Kvasir-Seg and HQ-44k with orders of magnitude fewer trainable parameters and achieves about average mIoU gains in 5-shot settings across multiple technical datasets. The work demonstrates a practical, user-in-the-loop deployment pathway (VP Lab) that delivers production-ready domain-adapted segmentation within minutes, facilitating rapid deployment in medical-imaging and other technical inspection domains.

Abstract

Large-scale pretrained vision backbones have transformed computer vision by providing powerful feature extractors that enable various downstream tasks, including training-free approaches like visual prompting for semantic segmentation. Despite their success in generic scenarios, these models often fall short when applied to specialized technical domains where the visual features differ significantly from their training distribution. To bridge this gap, we introduce VP Lab, a comprehensive iterative framework that enhances visual prompting for robust segmentation model development. At the core of VP Lab lies E-PEFT, a novel ensemble of parameter-efficient fine-tuning techniques specifically designed to adapt our visual prompting pipeline to specific domains in a manner that is both parameter- and data-efficient. Our approach not only surpasses the state-of-the-art in parameter-efficient fine-tuning for the Segment Anything Model (SAM), but also facilitates an interactive, near-real-time loop, allowing users to observe progressively improving results as they experiment within the framework. By integrating E-PEFT with visual prompting, we demonstrate a remarkable 50\% increase in semantic segmentation mIoU performance across various technical datasets using only 5 validated images, establishing a new paradigm for fast, efficient, and interactive model deployment in new, challenging domains. This work comes in the form of a demonstration.

Paper Structure

This paper contains 3 sections, 1 figure, 3 tables.

Figures (1)

  • Figure 1: VP Lab Workflow: Users prompt and validate a reference image, which guides predictions on target datasets. They can refine these predictions with a labeling tool, and feed them into a parameter-efficient fine-tuning process of the underlying model. After iterative improvements, the optimized model can be exported and deployed as needed.