Table of Contents
Fetching ...

SPT: Sequence Prompt Transformer for Interactive Image Segmentation

Senlin Cheng, Haopeng Sun

TL;DR

The paper tackles interactive segmentation across sequences of images by introducing the Sequence Prompt Transformer (SPT), which leverages prior frames, user clicks, and predicted masks as prompts. A Top-k Prompt Selection (TPS) module selects the most informative prompts via DINOv2 features, and a ViT-based backbone feeds into a Sequence Prompt Transformer with concealed self-attention to prevent leakage of future information. The model is trained with a lightweight MLP segmentation head after a Feature Pyramid Module and is optimized using Focal Loss, achieving state-of-the-art results on multiple benchmarks including a new ADE20K-Seq sequential dataset. This approach enables more accurate and efficient segmentation in tasks where objects persist across image sequences, offering practical benefits for video-like editing and large-scale annotations.

Abstract

Interactive segmentation aims to extract objects of interest from an image based on user-provided clicks. In real-world applications, there is often a need to segment a series of images featuring the same target object. However, existing methods typically process one image at a time, failing to consider the sequential nature of the images. To overcome this limitation, we propose a novel method called Sequence Prompt Transformer (SPT), the first to utilize sequential image information for interactive segmentation. Our model comprises two key components: (1) Sequence Prompt Transformer (SPT) for acquiring information from sequence of images, clicks and masks to improve accurate. (2) Top-k Prompt Selection (TPS) selects precise prompts for SPT to further enhance the segmentation effect. Additionally, we create the ADE20K-Seq benchmark to better evaluate model performance. We evaluate our approach on multiple benchmark datasets and show that our model surpasses state-of-the-art methods across all datasets.

SPT: Sequence Prompt Transformer for Interactive Image Segmentation

TL;DR

The paper tackles interactive segmentation across sequences of images by introducing the Sequence Prompt Transformer (SPT), which leverages prior frames, user clicks, and predicted masks as prompts. A Top-k Prompt Selection (TPS) module selects the most informative prompts via DINOv2 features, and a ViT-based backbone feeds into a Sequence Prompt Transformer with concealed self-attention to prevent leakage of future information. The model is trained with a lightweight MLP segmentation head after a Feature Pyramid Module and is optimized using Focal Loss, achieving state-of-the-art results on multiple benchmarks including a new ADE20K-Seq sequential dataset. This approach enables more accurate and efficient segmentation in tasks where objects persist across image sequences, offering practical benefits for video-like editing and large-scale annotations.

Abstract

Interactive segmentation aims to extract objects of interest from an image based on user-provided clicks. In real-world applications, there is often a need to segment a series of images featuring the same target object. However, existing methods typically process one image at a time, failing to consider the sequential nature of the images. To overcome this limitation, we propose a novel method called Sequence Prompt Transformer (SPT), the first to utilize sequential image information for interactive segmentation. Our model comprises two key components: (1) Sequence Prompt Transformer (SPT) for acquiring information from sequence of images, clicks and masks to improve accurate. (2) Top-k Prompt Selection (TPS) selects precise prompts for SPT to further enhance the segmentation effect. Additionally, we create the ADE20K-Seq benchmark to better evaluate model performance. We evaluate our approach on multiple benchmark datasets and show that our model surpasses state-of-the-art methods across all datasets.

Paper Structure

This paper contains 17 sections, 9 equations, 4 figures.

Figures (4)

  • Figure 1: Segmentation of car windows: (a) Existing methods process individual images, causing prediction faults; (b) SPT learns useful information from previous images, clicks, and masks to achieve precise results.
  • Figure 2: Overview of Sequence Prompt Transformer (SPT).
  • Figure 3: Qualitative analysis on the ADE20K-Sep dataset. (a) Image. (b) Ground-truth mask. (c) Results of SimpleClick. (d) Results of Focalclick. (e) Results of RITM. (e) Results of SPT (ours).
  • Figure 4: Comparison of MIoU performance with different numbers of clicks against baselines on the ADE20K-Sep dataset. MIoU higher is better.