Table of Contents
Fetching ...

Adapting SAM with Dynamic Similarity Graphs for Few-Shot Parameter-Efficient Small Dense Object Detection: A Case Study of Chickpea Pods in Field Conditions

Xintong Jiang, Yixue Liu, Mohamed Debbagh, Yu Tian, Valerio Hoyos-Villegas, Viacheslav Adamchuk, Shangpeng Sun

TL;DR

This work tackles the challenge of few-shot, pixel-level segmentation of small dense agricultural organs under complex field conditions. It introduces Dynamic Similarity-based Graph Adaptation (DSGA) combined with Low-Rank Adaptation (LoRA) to adapt Segment Anything Model (SAM) for both foreground and instance segmentation with minimal data. DSGA builds a dynamic adjacency graph with learnable rank-weighted neighbors and adaptive local pooling to capture global and local dependencies, while LoRA tunes only the query and value projections, yielding a parameter-efficient framework (roughly 4.6% of SAM) that is fired in a two-stage process with adaptive prompt generation and a composite loss. Empirical results on a chickpea pod dataset show superior performance on both foreground and instance segmentation across 2–10 shots, along with interpretable visualizations (Grad-CAM, t-SNE) and strong field-counting accuracy (adjusted R^2 ≈ 0.899), highlighting practical applicability for automated agricultural monitoring and phenotyping; limitations include resolution constraints and occlusion, with future directions toward multispectral data and cross-crop generalization.

Abstract

Parameter-Efficient Fine-Tuning (PEFT) of foundation models for agricultural computer vision tasks remains challenging due to limited training data and complex field conditions. This study introduces a Dynamic Similarity-based Graph Adaptation (DSGA) module to adapt the Segment Anything Model (SAM) under extreme data constraints for precise foreground and instance segmentation of small dense objects in complex agricultural environments. Through dynamic similarity graph construction with a learnable polynomial decay-initialized weight ranking mechanism and adaptive local feature aggregation, DSGA establishes robust spatial and dynamic similarity representation with only 4.00M trainable parameters, which is 4.26% of the original SAM. Integrating this graph-based feature adaptation with Low-Rank Adaptation (LoRA) creates a complementary optimization framework that effectively captures both local and global dependencies in image embeddings while preserving model stability and parameter efficiency. Experimental results on a challenging chickpea pod dataset demonstrated that DSGA with LoRA achieved superior performance across multiple metrics evaluated under 2, 4, 8 and 10 shots, with progressive performance gains as shot count increased. Quantitative metrics showed a 17.31% improvement in Structure-measure and a 62.36% gain in adaptive F-measure compared to the baseline SAM fine-tuning. Comprehensive ablation studies and visualization analyses through Grad-CAM and t-SNE validated the framework's effectiveness in feature discrimination. The proposed adaptation demonstrated practical utility for automated agricultural monitoring applications, achieving accurate pod-counting with an adjusted R-squared of 0.8987 for images with 10 to 120 pods under challenging field conditions.

Adapting SAM with Dynamic Similarity Graphs for Few-Shot Parameter-Efficient Small Dense Object Detection: A Case Study of Chickpea Pods in Field Conditions

TL;DR

This work tackles the challenge of few-shot, pixel-level segmentation of small dense agricultural organs under complex field conditions. It introduces Dynamic Similarity-based Graph Adaptation (DSGA) combined with Low-Rank Adaptation (LoRA) to adapt Segment Anything Model (SAM) for both foreground and instance segmentation with minimal data. DSGA builds a dynamic adjacency graph with learnable rank-weighted neighbors and adaptive local pooling to capture global and local dependencies, while LoRA tunes only the query and value projections, yielding a parameter-efficient framework (roughly 4.6% of SAM) that is fired in a two-stage process with adaptive prompt generation and a composite loss. Empirical results on a chickpea pod dataset show superior performance on both foreground and instance segmentation across 2–10 shots, along with interpretable visualizations (Grad-CAM, t-SNE) and strong field-counting accuracy (adjusted R^2 ≈ 0.899), highlighting practical applicability for automated agricultural monitoring and phenotyping; limitations include resolution constraints and occlusion, with future directions toward multispectral data and cross-crop generalization.

Abstract

Parameter-Efficient Fine-Tuning (PEFT) of foundation models for agricultural computer vision tasks remains challenging due to limited training data and complex field conditions. This study introduces a Dynamic Similarity-based Graph Adaptation (DSGA) module to adapt the Segment Anything Model (SAM) under extreme data constraints for precise foreground and instance segmentation of small dense objects in complex agricultural environments. Through dynamic similarity graph construction with a learnable polynomial decay-initialized weight ranking mechanism and adaptive local feature aggregation, DSGA establishes robust spatial and dynamic similarity representation with only 4.00M trainable parameters, which is 4.26% of the original SAM. Integrating this graph-based feature adaptation with Low-Rank Adaptation (LoRA) creates a complementary optimization framework that effectively captures both local and global dependencies in image embeddings while preserving model stability and parameter efficiency. Experimental results on a challenging chickpea pod dataset demonstrated that DSGA with LoRA achieved superior performance across multiple metrics evaluated under 2, 4, 8 and 10 shots, with progressive performance gains as shot count increased. Quantitative metrics showed a 17.31% improvement in Structure-measure and a 62.36% gain in adaptive F-measure compared to the baseline SAM fine-tuning. Comprehensive ablation studies and visualization analyses through Grad-CAM and t-SNE validated the framework's effectiveness in feature discrimination. The proposed adaptation demonstrated practical utility for automated agricultural monitoring applications, achieving accurate pod-counting with an adjusted R-squared of 0.8987 for images with 10 to 120 pods under challenging field conditions.

Paper Structure

This paper contains 32 sections, 23 equations, 11 figures, 4 tables.

Figures (11)

  • Figure 1: Representative existing adaptation methods. (a) Bottleneck Adapter with dimension reduction and expansion through MLP layers, (b) LoRA adapting pre-trained weights of the Query (Q) and Value (V) projection matrices through low-rank matrices, and (c) Visual Prompt Tuning that adds learnable embeddings to the input sequence.
  • Figure 2: Architecture overview of the proposed framework. Architecture overview of the proposed PEFT framework. The two-stage process begins with promptless foreground extraction using a SAM image encoder adapted with DSGA&LoRA. The system then automatically generates point prompts from foreground regions, which guide instance-level segmentation in the second stage. The final output retains only the instance masks with the highest predicted IoU scores, eliminating duplicates when multiple prompts target the same object.
  • Figure 3: SAM image encoder adaptation with LoRA and DSGA. The multi-head attention section remains mostly frozen, with LoRA adaptation applied only to Query and Value projection matrices. The high-level overview of DSGA illustrates the dynamic graph construction during adaptation. While LoRA operates inside attention, DSGA modules are positioned at the terminal layers of each ViT block to process the refined image embeddings.
  • Figure 4: Architecture of the DSGA module. The bottleneck structure reduces dimensionality for DSGA to adapt features in lower-dimensional space before restoring original dimensions. The DSGA enhances feature representations through: (a) Dynamic Similarity Adjacency Graph Construction, which establishes global dependencies via $\text{L}^2$-normalized similarity computation with learnable top-k selection parameter $\theta_k$, as well as learnable rank weights $w_r$, followed by $\text{L}^1$ row-wise normalization; and (b) Adaptive Hybrid Pooling, which captures local context through weighted fusion of maximum and average operations controlled by learnable parameters $w_p$ and $w_n$. While adapting, the module preserves feature stability during adaptation with residual connections.
  • Figure 5: Prompt generation module based on grid-based sampling and instance optimization. (a) Partitioning of the predicted foreground segmentation mask into uniform grid cells with dimensions constrained by the minimum instance size to ensure comprehensive coverage. (b) Spatial distribution of automatically generated point prompts on the feature activation map, with each prompt corresponding to the region center of high semantic relevance. (c) All 682 candidate instance-level segmentation masks inferred from each generated prompt point. (d) Selection of optimal instance mask based on spatial overlapping. (e) Final optimized instance masks selected based on predicted IoUs, reducing the total instances from 678 candidates to 127 distinct masks.
  • ...and 6 more figures