Generalized SAM: Efficient Fine-Tuning of SAM for Variable Input Image Sizes

Sota Kato; Hinako Mitsuoka; Kazuhiro Hotta

Generalized SAM: Efficient Fine-Tuning of SAM for Variable Input Image Sizes

Sota Kato, Hinako Mitsuoka, Kazuhiro Hotta

TL;DR

This work tackles the high computational cost and information loss in fine-tuning Segment Anything Model (SAM) due to its fixed input size of $1024 × 1024$. It introduces Generalized SAM (GSAM), combining training-time random cropping with a Positional Encoding Generator (PEG) and a CNN Encoder to merge SAM and CNN features, plus Spatial-Multiscale AdaptFormer to capture multi-scale context. Empirical results on seven diverse datasets show GSAM achieves comparable or superior segmentation accuracy while substantially reducing training MACs, with notable gains on CT data (e.g., $11.50%$ over AdaptFormer on Synapse) and the ability to preserve original aspect ratios in variable inputs. Overall, GSAM provides a practical, efficient framework for adapting SAM to arbitrary input sizes across varied domains, enhancing its applicability for real-world semantic segmentation tasks.

Abstract

There has been a lot of recent research on improving the efficiency of fine-tuning foundation models. In this paper, we propose a novel efficient fine-tuning method that allows the input image size of Segment Anything Model (SAM) to be variable. SAM is a powerful foundational model for image segmentation trained on huge datasets, but it requires fine-tuning to recognize arbitrary classes. The input image size of SAM is fixed at 1024 x 1024, resulting in substantial computational demands during training. Furthermore, the fixed input image size may result in the loss of image information, e.g. due to fixed aspect ratios. To address this problem, we propose Generalized SAM (GSAM). Different from the previous methods, GSAM is the first to apply random cropping during training with SAM, thereby significantly reducing the computational cost of training. Experiments on datasets of various types and various pixel counts have shown that GSAM can train more efficiently than SAM and other fine-tuning methods for SAM, achieving comparable or higher accuracy.

Generalized SAM: Efficient Fine-Tuning of SAM for Variable Input Image Sizes

TL;DR

This work tackles the high computational cost and information loss in fine-tuning Segment Anything Model (SAM) due to its fixed input size of

. It introduces Generalized SAM (GSAM), combining training-time random cropping with a Positional Encoding Generator (PEG) and a CNN Encoder to merge SAM and CNN features, plus Spatial-Multiscale AdaptFormer to capture multi-scale context. Empirical results on seven diverse datasets show GSAM achieves comparable or superior segmentation accuracy while substantially reducing training MACs, with notable gains on CT data (e.g.,

over AdaptFormer on Synapse) and the ability to preserve original aspect ratios in variable inputs. Overall, GSAM provides a practical, efficient framework for adapting SAM to arbitrary input sizes across varied domains, enhancing its applicability for real-world semantic segmentation tasks.

Abstract

Paper Structure (19 sections, 1 equation, 6 figures, 4 tables)

This paper contains 19 sections, 1 equation, 6 figures, 4 tables.

Introduction
Related Works
Segmentation Models
Foundation Models
Efficient Fine-tuning for SAM
Changing The Input Image Size for SAM
Proposed Method
Application of Random Cropping during Training
Spatial-Multiscale AdaptFormer
Experiments
Datasets and Metrics
Training Conditions
Experimental Results
Quantitative Results.
Qualitative Results.
...and 4 more sections

Figures (6)

Figure 1: When SAM is fine-tuned for semantic segmentation with conventional methods, only fixed-size images can be input. As a result, input images are deformed to fit a specific size, causing information loss. In contrast, GSAM supports various input image sizes while maintaining the superior segmentation performance of SAM. This allows images to be used in their original form and enables random cropping during fine-tuning, previously unavailable in SAM-related methods. GSAM provides efficient fine-tuning, specialized for semantic segmentation of arbitrary data, minimizing information loss and computational costs.
Figure 2: The trade-off between MACs and segmentation accuracy (mIoU) for conventional fine-tuning methods for SAM on the ISBI2012 dataset isbi. The Red circles indicate our proposed GSAM and the triangles indicate the conventional methods. Random cropping is only performed for GSAM, cropping to the number of pixels indicated by the "size". Random cropping cannot be used except for GSAM due to its structure.
Figure 3: Overview of Generalized SAM. $FROZEN$ indicates a network in which the weight parameters are fixed, and $Learnable$ indicates a network in which the weight parameters are updated.
Figure 4: Overview of Spatial-Multiscale AdaptFormer. Five convolutional layers with different receptive fields are used to acquire the spatial features necessary for semantic segmentation.
Figure 5: Qualitative results. The first row is the results on the ISBI2012, the second is on the M-Building, the third is on the Cityscapes, and the fourth is on the Trans10k dataset. The Cityscapes dataset can be fed into GSAM with its original aspect ratio, but for simplicity of comparison, the same aspect ratios as other methods are shown. (a) Input image, (b) Ground truth, (c) SAM sam, (d) AdaptFormer adaptformer, (e) SAMUS samus, (f) GSAM.
...and 1 more figures

Generalized SAM: Efficient Fine-Tuning of SAM for Variable Input Image Sizes

TL;DR

Abstract

Generalized SAM: Efficient Fine-Tuning of SAM for Variable Input Image Sizes

Authors

TL;DR

Abstract

Table of Contents

Figures (6)