Table of Contents
Fetching ...

BA-SAM: Scalable Bias-Mode Attention Mask for Segment Anything Model

Yiran Song, Qianyu Zhou, Xiangtai Li, Deng-Ping Fan, Xuequan Lu, Lizhuang Ma

TL;DR

A Scalable Bias-Mode Attention Mask (BA-SAM) is proposed to enhance SAM's adaptability to varying image resolutions while eliminating the need for structure modifications, and a new scaling factor is introduced to ensure consistent magnitude in the attention layer's dot product values when the token sequence length changes.

Abstract

In this paper, we address the challenge of image resolution variation for the Segment Anything Model (SAM). SAM, known for its zero-shot generalizability, exhibits a performance degradation when faced with datasets with varying image sizes. Previous approaches tend to resize the image to a fixed size or adopt structure modifications, hindering the preservation of SAM's rich prior knowledge. Besides, such task-specific tuning necessitates a complete retraining of the model, which is cost-expensive and unacceptable for deployment in the downstream tasks. In this paper, we reformulate this issue as a length extrapolation problem, where token sequence length varies while maintaining a consistent patch size for images of different sizes. To this end, we propose Scalable Bias-Mode Attention Mask (BA-SAM) to enhance SAM's adaptability to varying image resolutions while eliminating the need for structure modifications. Firstly, we introduce a new scaling factor to ensure consistent magnitude in the attention layer's dot product values when the token sequence length changes. Secondly, we present a bias-mode attention mask that allows each token to prioritize neighboring information, mitigating the impact of untrained distant information. Our BA-SAM demonstrates efficacy in two scenarios: zero-shot and fine-tuning. Extensive evaluation on diverse datasets, including DIS5K, DUTS, ISIC, COD10K, and COCO, reveals its ability to significantly mitigate performance degradation in the zero-shot setting and achieve state-of-the-art performance with minimal fine-tuning. Furthermore, we propose a generalized model and benchmark, showcasing BA-SAM's generalizability across all four datasets simultaneously. Code is available at https://github.com/zongzi13545329/BA-SAM

BA-SAM: Scalable Bias-Mode Attention Mask for Segment Anything Model

TL;DR

A Scalable Bias-Mode Attention Mask (BA-SAM) is proposed to enhance SAM's adaptability to varying image resolutions while eliminating the need for structure modifications, and a new scaling factor is introduced to ensure consistent magnitude in the attention layer's dot product values when the token sequence length changes.

Abstract

In this paper, we address the challenge of image resolution variation for the Segment Anything Model (SAM). SAM, known for its zero-shot generalizability, exhibits a performance degradation when faced with datasets with varying image sizes. Previous approaches tend to resize the image to a fixed size or adopt structure modifications, hindering the preservation of SAM's rich prior knowledge. Besides, such task-specific tuning necessitates a complete retraining of the model, which is cost-expensive and unacceptable for deployment in the downstream tasks. In this paper, we reformulate this issue as a length extrapolation problem, where token sequence length varies while maintaining a consistent patch size for images of different sizes. To this end, we propose Scalable Bias-Mode Attention Mask (BA-SAM) to enhance SAM's adaptability to varying image resolutions while eliminating the need for structure modifications. Firstly, we introduce a new scaling factor to ensure consistent magnitude in the attention layer's dot product values when the token sequence length changes. Secondly, we present a bias-mode attention mask that allows each token to prioritize neighboring information, mitigating the impact of untrained distant information. Our BA-SAM demonstrates efficacy in two scenarios: zero-shot and fine-tuning. Extensive evaluation on diverse datasets, including DIS5K, DUTS, ISIC, COD10K, and COCO, reveals its ability to significantly mitigate performance degradation in the zero-shot setting and achieve state-of-the-art performance with minimal fine-tuning. Furthermore, we propose a generalized model and benchmark, showcasing BA-SAM's generalizability across all four datasets simultaneously. Code is available at https://github.com/zongzi13545329/BA-SAM
Paper Structure (13 sections, 10 equations, 4 figures, 5 tables)

This paper contains 13 sections, 10 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Top: contrast between prior methods zhang2023customizedbeyer2023flexivit and BA-SAM. For large-scale datasets, previous approaches often resize images or change patch sizes to handle the issue of varying resolutions. In contrast, we propose a Scalable Bias-Mode Attention Mask (BA-SAM), which enhances SAM’s adaptability to varying image resolutions while eliminating structure modifications. Bottom (left): We introduce a generalized model that outperforms state-of-the-art methods across four datasets. Bottom (right): With resolution variations, prior models' performance degrades drastically. Instead, BA-SAM consistently alleviates this issue (The evaluation metric is MAE).
  • Figure 2: Illustration of the proposed BA-SAM method. (a) In the original SAM, when input token sequences length varies during testing, the magnitude of the Softmax outputs changes drastically. We propose a new scaling factor to address this issue. (b) We introduce a bias-mode attention mask, which produces increasing penalties on attention scores as the distance between the query and key grows.
  • Figure 3: Embedding of our BA-SAM into a SAM backbone. NSF indicates our new scaling factor, and BM-AM denotes our designed bias-mode attention mask.
  • Figure 4: Visualization results of our BA-SAM on four object segmentation tasks, i.e., skin lesion segmentation, salient object segmentation, complex object segmentation, camouflaged object detection, which corresponds to four datasets: ISIC codella2018skin, DUTS DUTS, DIS-TE4 DIS, and COD10K COD. Our BA-SAM can handle the issue of varying image resolutions and segments accurately in different tasks.