Table of Contents
Fetching ...

SkipClick: Combining Quick Responses and Low-Level Features for Interactive Segmentation in Winter Sports Contexts

Robin Schön, Julian Lorenz, Daniel Kienzle, Rainer Lienhart

TL;DR

SkipClick tackles the need for fast, accurate interactive segmentation in winter sports by separating image encoding from prompt processing (late fusion) and enhancing the fusion with multi-level, low-level features and skip connections. The baseline uses a ViT backbone with prompt maps fused through shallow transformers, while SkipClick incorporates intermediate ViT features and a multi-scale feature pyramid feeding a SegFormer decoder, achieving substantial reductions in the number of clicks required (NoC) compared with SAM and HQ-SAM. The approach delivers state-of-the-art results on HQSeg-44k (NoC@90=6.00, NoC@95=9.89) and strong performance on WSESeg and the newly introduced SHSeg dataset, while maintaining real-time response speeds (~6.6 ms per click). Importantly, the architecture generalizes well to non-winter-sports data, as shown by ablations on standard datasets and competitive performance on DAVIS, indicating practical applicability beyond the target domain.

Abstract

In this paper, we present a novel architecture for interactive segmentation in winter sports contexts. The field of interactive segmentation deals with the prediction of high-quality segmentation masks by informing the network about the objects position with the help of user guidance. In our case the guidance consists of click prompts. For this task, we first present a baseline architecture which is specifically geared towards quickly responding after each click. Afterwards, we motivate and describe a number of architectural modifications which improve the performance when tasked with segmenting winter sports equipment on the WSESeg dataset. With regards to the average NoC@85 metric on the WSESeg classes, we outperform SAM and HQ-SAM by 2.336 and 7.946 clicks, respectively. When applied to the HQSeg-44k dataset, our system delivers state-of-the-art results with a NoC@90 of 6.00 and NoC@95 of 9.89. In addition to that, we test our model on a novel dataset containing masks for humans during skiing.

SkipClick: Combining Quick Responses and Low-Level Features for Interactive Segmentation in Winter Sports Contexts

TL;DR

SkipClick tackles the need for fast, accurate interactive segmentation in winter sports by separating image encoding from prompt processing (late fusion) and enhancing the fusion with multi-level, low-level features and skip connections. The baseline uses a ViT backbone with prompt maps fused through shallow transformers, while SkipClick incorporates intermediate ViT features and a multi-scale feature pyramid feeding a SegFormer decoder, achieving substantial reductions in the number of clicks required (NoC) compared with SAM and HQ-SAM. The approach delivers state-of-the-art results on HQSeg-44k (NoC@90=6.00, NoC@95=9.89) and strong performance on WSESeg and the newly introduced SHSeg dataset, while maintaining real-time response speeds (~6.6 ms per click). Importantly, the architecture generalizes well to non-winter-sports data, as shown by ablations on standard datasets and competitive performance on DAVIS, indicating practical applicability beyond the target domain.

Abstract

In this paper, we present a novel architecture for interactive segmentation in winter sports contexts. The field of interactive segmentation deals with the prediction of high-quality segmentation masks by informing the network about the objects position with the help of user guidance. In our case the guidance consists of click prompts. For this task, we first present a baseline architecture which is specifically geared towards quickly responding after each click. Afterwards, we motivate and describe a number of architectural modifications which improve the performance when tasked with segmenting winter sports equipment on the WSESeg dataset. With regards to the average NoC@85 metric on the WSESeg classes, we outperform SAM and HQ-SAM by 2.336 and 7.946 clicks, respectively. When applied to the HQSeg-44k dataset, our system delivers state-of-the-art results with a NoC@90 of 6.00 and NoC@95 of 9.89. In addition to that, we test our model on a novel dataset containing masks for humans during skiing.
Paper Structure (20 sections, 10 equations, 4 figures, 6 tables)

This paper contains 20 sections, 10 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: A comparison of the performance of SkipClick with other methods on the WSESeg schoen2024wseseg dataset. The metric is NoC@85.
  • Figure 2: The baseline (left) and the SkipClick (right) architecture. Note that the bulk of the computation happens in the backbone, which only has be executed once per image. Freezing the backbone during training enables the backbone to retain its generality from unsupervised pretraining. The use of multi-level features and skip connections allows the model to deal with fine structures encountered when segmenting winter sports equipment.
  • Figure 3: Qualitative examples on WSESeg. Foreground clicks are green, background clicks are red and the masks are blue.
  • Figure 4: Examples for the masks occurring during the interaction. The left column displays the predicted mask along with the clicks. Foreground clicks are green, background clicks are red and the masks are blue. The right column displays the corresponding ground truth.