Table of Contents
Fetching ...

HRSAM: Efficient Interactive Segmentation in High-Resolution Images

You Huang, Wenbin Lai, Jiayi Ji, Liujuan Cao, Shengchuan Zhang, Rongrong Ji

TL;DR

This work focuses on visual length extrapolation and proposes a lightweight model named HRSAM, based on Swin attention, which enables HRSAM trained on low resolutions to generalize to high resolutions and outperform the teacher model at lower latency.

Abstract

The Segment Anything Model (SAM) has advanced interactive segmentation but is limited by the high computational cost on high-resolution images. This requires downsampling to meet GPU constraints, sacrificing the fine-grained details needed for high-precision interactive segmentation. To address SAM's limitations, we focus on visual length extrapolation and propose a lightweight model named HRSAM. The extrapolation enables HRSAM trained on low resolutions to generalize to high resolutions. We begin by finding the link between the extrapolation and attention scores, which leads us to base HRSAM on Swin attention. We then introduce the Flexible Local Attention (FLA) framework, using CUDA-optimized Efficient Memory Attention to accelerate HRSAM. Within FLA, we implement Flash Swin attention, achieving over a 35% speedup compared to traditional Swin attention, and propose a KV-only padding mechanism to enhance extrapolation. We also develop the Cycle-scan module that uses State Space models to efficiently expand HRSAM's receptive field. We further develop the HRSAM++ within FLA by adding an anchor map, providing multi-scale data augmentation for the extrapolation and a larger receptive field at slight computational cost. Experiments show that, under standard training, HRSAMs surpass the previous SOTA with only 38% of the latency. With SAM-distillation, the extrapolation enables HRSAMs to outperform the teacher model at lower latency. Further finetuning achieves performance significantly exceeding the previous SOTA.

HRSAM: Efficient Interactive Segmentation in High-Resolution Images

TL;DR

This work focuses on visual length extrapolation and proposes a lightweight model named HRSAM, based on Swin attention, which enables HRSAM trained on low resolutions to generalize to high resolutions and outperform the teacher model at lower latency.

Abstract

The Segment Anything Model (SAM) has advanced interactive segmentation but is limited by the high computational cost on high-resolution images. This requires downsampling to meet GPU constraints, sacrificing the fine-grained details needed for high-precision interactive segmentation. To address SAM's limitations, we focus on visual length extrapolation and propose a lightweight model named HRSAM. The extrapolation enables HRSAM trained on low resolutions to generalize to high resolutions. We begin by finding the link between the extrapolation and attention scores, which leads us to base HRSAM on Swin attention. We then introduce the Flexible Local Attention (FLA) framework, using CUDA-optimized Efficient Memory Attention to accelerate HRSAM. Within FLA, we implement Flash Swin attention, achieving over a 35% speedup compared to traditional Swin attention, and propose a KV-only padding mechanism to enhance extrapolation. We also develop the Cycle-scan module that uses State Space models to efficiently expand HRSAM's receptive field. We further develop the HRSAM++ within FLA by adding an anchor map, providing multi-scale data augmentation for the extrapolation and a larger receptive field at slight computational cost. Experiments show that, under standard training, HRSAMs surpass the previous SOTA with only 38% of the latency. With SAM-distillation, the extrapolation enables HRSAMs to outperform the teacher model at lower latency. Further finetuning achieves performance significantly exceeding the previous SOTA.
Paper Structure (28 sections, 20 equations, 13 figures, 4 tables)

This paper contains 28 sections, 20 equations, 13 figures, 4 tables.

Figures (13)

  • Figure 1: Analysis of visual length extrapolation over Global-attn and Swin-attn ViTs. (a) illustrates the mean IoU at each interaction step under the standard testing protocol of interactive segmentation on HQSeg-44K for both ViTs, with input resolutions of $1024^2$ and $2048^2$. (b) shows the inference latency per image for both ViTs at different input resolutions. (c) presents a box plot of the top-1 attention scores (post-softmax) for each image token across all attention computations in both ViTs, with the top $10\%$ values removed for visual clarity and red labels indicating the 0.25 and 0.75 quantiles. Global-attn ViT shows a more pronounced reduction in these scores at higher resolutions.
  • Figure 2: Overview of our proposed HRSAM. HRSAM contains four stages, each with three Flash Swin modules and one Cycle-scan module. Alternating Flash Swin modules apply shifts of 0 and 8 tokens. Each Flash Swin is implemented within the FLA framework, incorporating token reordering (via index mapping) and local attention computation accelerated by EMA. Outputs from all stages are fused through summation and refined via a convolutional block to create final image embeddings fed into the SAM decoder.
  • Figure 3: Illustration of EMA's block-diagonal attention xformers_ops. Subsequences of different lengths are shown in blue, yellow and green, with each sequence's attention computation indicated in gray. The black regions represent skipped areas in EMA's computation, optimizing parallelism and reducing memory usage.
  • Figure 4: Illustration of Flash Swin's index mapping for a simple case. The top part shows the shift operation from a 2D perspective, while the bottom part depicts the actual index mapping.
  • Figure 5: Latency (ms) of various Swin attention implementations. The evaluation is conducted with $h = 64$, $w = 64$ and $C = 768$.
  • ...and 8 more figures