SkipClick: Combining Quick Responses and Low-Level Features for Interactive Segmentation in Winter Sports Contexts
Robin Schön, Julian Lorenz, Daniel Kienzle, Rainer Lienhart
TL;DR
SkipClick tackles the need for fast, accurate interactive segmentation in winter sports by separating image encoding from prompt processing (late fusion) and enhancing the fusion with multi-level, low-level features and skip connections. The baseline uses a ViT backbone with prompt maps fused through shallow transformers, while SkipClick incorporates intermediate ViT features and a multi-scale feature pyramid feeding a SegFormer decoder, achieving substantial reductions in the number of clicks required (NoC) compared with SAM and HQ-SAM. The approach delivers state-of-the-art results on HQSeg-44k (NoC@90=6.00, NoC@95=9.89) and strong performance on WSESeg and the newly introduced SHSeg dataset, while maintaining real-time response speeds (~6.6 ms per click). Importantly, the architecture generalizes well to non-winter-sports data, as shown by ablations on standard datasets and competitive performance on DAVIS, indicating practical applicability beyond the target domain.
Abstract
In this paper, we present a novel architecture for interactive segmentation in winter sports contexts. The field of interactive segmentation deals with the prediction of high-quality segmentation masks by informing the network about the objects position with the help of user guidance. In our case the guidance consists of click prompts. For this task, we first present a baseline architecture which is specifically geared towards quickly responding after each click. Afterwards, we motivate and describe a number of architectural modifications which improve the performance when tasked with segmenting winter sports equipment on the WSESeg dataset. With regards to the average NoC@85 metric on the WSESeg classes, we outperform SAM and HQ-SAM by 2.336 and 7.946 clicks, respectively. When applied to the HQSeg-44k dataset, our system delivers state-of-the-art results with a NoC@90 of 6.00 and NoC@95 of 9.89. In addition to that, we test our model on a novel dataset containing masks for humans during skiing.
