VLLFL: A Vision-Language Model Based Lightweight Federated Learning Framework for Smart Agriculture
Long Li, Jiajia Li, Dong Chen, Lina Pu, Haibo Yao, Yanbo Huang
TL;DR
VLLFL tackles privacy-preserving, scalable object detection in smart agriculture by coupling a large vision-language model with a lightweight, federated prompt generator. By freezing the base GroundingDINO and training only a compact prompt generator, the framework achieves a 99.3% reduction in communication while delivering meaningful improvements in detection accuracy (e.g., global mAP rising from 9.59% to 24.12%). The approach supports cross-farm generalization, works across fruit and wildlife detection tasks, and remains effective whether the base model is pre-fine-tuned or not, though gains decrease when the base is already fully fine-tuned. This work demonstrates a practical path to deploying capable VLMs in privacy-sensitive agricultural settings, enabling scalable collaboration without raw data sharing, and points to future enhancements via multimodal prompting to further boost performance.
Abstract
In modern smart agriculture, object detection plays a crucial role by enabling automation, precision farming, and monitoring of resources. From identifying crop health and pest infestations to optimizing harvesting processes, accurate object detection enhances both productivity and sustainability. However, training object detection models often requires large-scale data collection and raises privacy concerns, particularly when sensitive agricultural data is distributed across farms. To address these challenges, we propose VLLFL, a vision-language model-based lightweight federated learning framework (VLLFL). It harnesses the generalization and context-aware detection capabilities of the vision-language model (VLM) and leverages the privacy-preserving nature of federated learning. By training a compact prompt generator to boost the performance of the VLM deployed across different farms, VLLFL preserves privacy while reducing communication overhead. Experimental results demonstrate that VLLFL achieves 14.53% improvement in the performance of VLM while reducing 99.3% communication overhead. Spanning tasks from identifying a wide variety of fruits to detecting harmful animals in agriculture, the proposed framework offers an efficient, scalable, and privacy-preserving solution specifically tailored to agricultural applications.
