CrowdVLM-R1: Expanding R1 Ability to Vision Language Model for Crowd Counting using Fuzzy Group Relative Policy Reward

Zhiqiang Wang; Pengbin Feng; Yanbin Lin; Shuzhang Cai; Zongao Bian; Jinghua Yan; Xingquan Zhu

CrowdVLM-R1: Expanding R1 Ability to Vision Language Model for Crowd Counting using Fuzzy Group Relative Policy Reward

Zhiqiang Wang, Pengbin Feng, Yanbin Lin, Shuzhang Cai, Zongao Bian, Jinghua Yan, Xingquan Zhu

TL;DR

This work targets crowd counting with vision-language models by replacing a binary evaluation signal with a fuzzy, multi-component reward. It introduces Fuzzy Group Relative Policy Reward (FGRPR), which combines a format-adherence term and a precision-based term $r_p$ with $r_p = 1.5 - \frac{|\hat{y}-y|}{y}$ when the relative error is under 0.5, and uses Group Relative Policy Optimization (GRPO) to train the model without a separate value function. The method is instantiated in CrowdVLM-R1 and outperforms strong baselines, including GPT-4o and LLaMA-70B, on five in-domain datasets, while achieving competitive out-of-domain results; larger target counts benefit most from the fuzzy reward. The approach is demonstrated on a diverse counting dataset and is shown to converge with sharper reward growth under FGRPR, suggesting broad applicability to estimation tasks requiring precise numerical outputs. The framework also provides a clear path for generalization to other domains where counting precision is critical.

Abstract

We propose Fuzzy Group Relative Policy Reward (FGRPR), a novel framework that integrates Group Relative Policy Optimization (GRPO) with a fuzzy reward function to enhance learning efficiency. Unlike the conventional binary 0/1 accuracy reward, our fuzzy reward model provides nuanced incentives, encouraging more precise outputs. Experimental results demonstrate that GRPO with a standard 0/1 accuracy reward underperforms compared to supervised fine-tuning (SFT). In contrast, FGRPR, applied to Qwen2.5-VL(3B and 7B), surpasses all baseline models, including GPT4o, LLaMA2(90B), and SFT, across five in-domain datasets. On an out-of-domain dataset, FGRPR achieves performance comparable to SFT but excels when target values are larger, as its fuzzy reward function assigns higher rewards to closer approximations. This approach is broadly applicable to tasks where the precision of the answer is critical. Code and data: https://github.com/yeyimilk/CrowdVLM-R1

CrowdVLM-R1: Expanding R1 Ability to Vision Language Model for Crowd Counting using Fuzzy Group Relative Policy Reward

TL;DR

Abstract

CrowdVLM-R1: Expanding R1 Ability to Vision Language Model for Crowd Counting using Fuzzy Group Relative Policy Reward

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)