Table of Contents
Fetching ...

CrowdVLM-R1: Expanding R1 Ability to Vision Language Model for Crowd Counting using Fuzzy Group Relative Policy Reward

Zhiqiang Wang, Pengbin Feng, Yanbin Lin, Shuzhang Cai, Zongao Bian, Jinghua Yan, Xingquan Zhu

TL;DR

This work targets crowd counting with vision-language models by replacing a binary evaluation signal with a fuzzy, multi-component reward. It introduces Fuzzy Group Relative Policy Reward (FGRPR), which combines a format-adherence term and a precision-based term $r_p$ with $r_p = 1.5 - \frac{|\hat{y}-y|}{y}$ when the relative error is under 0.5, and uses Group Relative Policy Optimization (GRPO) to train the model without a separate value function. The method is instantiated in CrowdVLM-R1 and outperforms strong baselines, including GPT-4o and LLaMA-70B, on five in-domain datasets, while achieving competitive out-of-domain results; larger target counts benefit most from the fuzzy reward. The approach is demonstrated on a diverse counting dataset and is shown to converge with sharper reward growth under FGRPR, suggesting broad applicability to estimation tasks requiring precise numerical outputs. The framework also provides a clear path for generalization to other domains where counting precision is critical.

Abstract

We propose Fuzzy Group Relative Policy Reward (FGRPR), a novel framework that integrates Group Relative Policy Optimization (GRPO) with a fuzzy reward function to enhance learning efficiency. Unlike the conventional binary 0/1 accuracy reward, our fuzzy reward model provides nuanced incentives, encouraging more precise outputs. Experimental results demonstrate that GRPO with a standard 0/1 accuracy reward underperforms compared to supervised fine-tuning (SFT). In contrast, FGRPR, applied to Qwen2.5-VL(3B and 7B), surpasses all baseline models, including GPT4o, LLaMA2(90B), and SFT, across five in-domain datasets. On an out-of-domain dataset, FGRPR achieves performance comparable to SFT but excels when target values are larger, as its fuzzy reward function assigns higher rewards to closer approximations. This approach is broadly applicable to tasks where the precision of the answer is critical. Code and data: https://github.com/yeyimilk/CrowdVLM-R1

CrowdVLM-R1: Expanding R1 Ability to Vision Language Model for Crowd Counting using Fuzzy Group Relative Policy Reward

TL;DR

This work targets crowd counting with vision-language models by replacing a binary evaluation signal with a fuzzy, multi-component reward. It introduces Fuzzy Group Relative Policy Reward (FGRPR), which combines a format-adherence term and a precision-based term with when the relative error is under 0.5, and uses Group Relative Policy Optimization (GRPO) to train the model without a separate value function. The method is instantiated in CrowdVLM-R1 and outperforms strong baselines, including GPT-4o and LLaMA-70B, on five in-domain datasets, while achieving competitive out-of-domain results; larger target counts benefit most from the fuzzy reward. The approach is demonstrated on a diverse counting dataset and is shown to converge with sharper reward growth under FGRPR, suggesting broad applicability to estimation tasks requiring precise numerical outputs. The framework also provides a clear path for generalization to other domains where counting precision is critical.

Abstract

We propose Fuzzy Group Relative Policy Reward (FGRPR), a novel framework that integrates Group Relative Policy Optimization (GRPO) with a fuzzy reward function to enhance learning efficiency. Unlike the conventional binary 0/1 accuracy reward, our fuzzy reward model provides nuanced incentives, encouraging more precise outputs. Experimental results demonstrate that GRPO with a standard 0/1 accuracy reward underperforms compared to supervised fine-tuning (SFT). In contrast, FGRPR, applied to Qwen2.5-VL(3B and 7B), surpasses all baseline models, including GPT4o, LLaMA2(90B), and SFT, across five in-domain datasets. On an out-of-domain dataset, FGRPR achieves performance comparable to SFT but excels when target values are larger, as its fuzzy reward function assigns higher rewards to closer approximations. This approach is broadly applicable to tasks where the precision of the answer is critical. Code and data: https://github.com/yeyimilk/CrowdVLM-R1

Paper Structure

This paper contains 20 sections, 13 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: An overview of our framework, which integrates group relative policy optimization (GRPO) and the proposed fuzzy group relative policy reward (FGRPR) to train a visual language model (VLM) for crowd counting tasks. From left to right, an image and prompt are input to a base LLM to obtain initial counting outcomes (The ground truth value on the top left corner of the input image is just an indicator number for visualization purposes. It is only used for rewards calculation, but will not send to LLM as part of the input. The instruction model uses a policy model, parameterized by $\theta$, to obtain policy and generate outputs. The proposed fuzzy group relative policy reward is then used to adjust the reward for the individual image. The adjusted reward is used to calculate the objective function value, which in turn updates the $\theta$ parameters for supervised fine-tuning.
  • Figure 2: Examples of model outputs during the training stage, showing various counting strategies. In the left image, where the crowd is dense and difficult to count, the model segments the scene into six regions and estimates the number of individuals in each. In the middle image, with only a few people, the model counts them individually. In the right image, which has a moderate density of sheep, the model scans row by row to ensure each entity is counted once. These strategies resemble human approaches to counting in different scenarios.
  • Figure 3: A conceptual view of R1 model expansion to vision language model for crowd counting (CrowdVLM+R1) using fuzzy group relative policy reward. Multiple crowd-sourced inputs, with ground-truth counting numbers, are used for supervised fine-tuning. The inputs are fed into a base model to generate outcomes, followed by the proposed fuzzy group relative policy reward (FGRPR) to determine individual rewards and system objective function value $\mathcal{J}_{GRPO}(\theta)$. Gradient propagation process is then applied to update $\theta$, and in terms of improving the base model for better counting results.
  • Figure 4: This figure presents the image examples from the test dataset where each column comes from the same original dataset and the first row shows an easier counting task while the second one is more complex within their original data sources.
  • Figure 5: Training reward curves for different models show the same trend, while for precision reward, bigger models dominate the smaller models within the same method over the training steps after smoothed. And their format reward lines do not have any difference after 100 steps.
  • ...and 1 more figures