Generalized SAM: Efficient Fine-Tuning of SAM for Variable Input Image Sizes
Sota Kato, Hinako Mitsuoka, Kazuhiro Hotta
TL;DR
This work tackles the high computational cost and information loss in fine-tuning Segment Anything Model (SAM) due to its fixed input size of $1024 × 1024$. It introduces Generalized SAM (GSAM), combining training-time random cropping with a Positional Encoding Generator (PEG) and a CNN Encoder to merge SAM and CNN features, plus Spatial-Multiscale AdaptFormer to capture multi-scale context. Empirical results on seven diverse datasets show GSAM achieves comparable or superior segmentation accuracy while substantially reducing training MACs, with notable gains on CT data (e.g., $11.50%$ over AdaptFormer on Synapse) and the ability to preserve original aspect ratios in variable inputs. Overall, GSAM provides a practical, efficient framework for adapting SAM to arbitrary input sizes across varied domains, enhancing its applicability for real-world semantic segmentation tasks.
Abstract
There has been a lot of recent research on improving the efficiency of fine-tuning foundation models. In this paper, we propose a novel efficient fine-tuning method that allows the input image size of Segment Anything Model (SAM) to be variable. SAM is a powerful foundational model for image segmentation trained on huge datasets, but it requires fine-tuning to recognize arbitrary classes. The input image size of SAM is fixed at 1024 x 1024, resulting in substantial computational demands during training. Furthermore, the fixed input image size may result in the loss of image information, e.g. due to fixed aspect ratios. To address this problem, we propose Generalized SAM (GSAM). Different from the previous methods, GSAM is the first to apply random cropping during training with SAM, thereby significantly reducing the computational cost of training. Experiments on datasets of various types and various pixel counts have shown that GSAM can train more efficiently than SAM and other fine-tuning methods for SAM, achieving comparable or higher accuracy.
