Table of Contents
Fetching ...

The Role of Entropy in Visual Grounding: Analysis and Optimization

Shuo Li, Jiajun Sun, Zhihao Zhang, Xiaoran Fan, Senjie Jin, Hui Li, Yuming Yang, Junjie Ye, Lixing Shen, Tao Ji, Tao Gui, Qi Zhang, Xuanjing Huang

TL;DR

This work examines how entropy behaves in perception-oriented visual grounding under reinforcement learning finetuning, revealing that high entropy persists due to low probability of high-advantage responses and annotation noise. It introduces ECVGPO, an entropy-controlled extension of GRPO that uses self-information to reshape positive-sample advantages, thereby balancing exploration and exploitation with improved stability. Empirical results across multiple multimodal models and grounding benchmarks show consistent gains in accuracy, generalization, and training stability without extra computational costs. The findings offer a practical entropy-aware policy optimization strategy for perception tasks in multimodal systems.

Abstract

Recent advances in fine-tuning multimodal large language models (MLLMs) using reinforcement learning have achieved remarkable progress, particularly with the introduction of various entropy control techniques. However, the role and characteristics of entropy in perception-oriented tasks like visual grounding, as well as effective strategies for controlling it, remain largely unexplored. To address this issue, we focus on the visual grounding task and analyze the role and characteristics of entropy in comparison to reasoning tasks. Building on these findings, we introduce ECVGPO (Entropy Control Visual Grounding Policy Optimization), an interpretable algorithm designed for effective entropy regulation. Through entropy control, the trade-off between exploration and exploitation is better balanced. Experiments show that ECVGPO achieves broad improvements across various benchmarks and models.

The Role of Entropy in Visual Grounding: Analysis and Optimization

TL;DR

This work examines how entropy behaves in perception-oriented visual grounding under reinforcement learning finetuning, revealing that high entropy persists due to low probability of high-advantage responses and annotation noise. It introduces ECVGPO, an entropy-controlled extension of GRPO that uses self-information to reshape positive-sample advantages, thereby balancing exploration and exploitation with improved stability. Empirical results across multiple multimodal models and grounding benchmarks show consistent gains in accuracy, generalization, and training stability without extra computational costs. The findings offer a practical entropy-aware policy optimization strategy for perception tasks in multimodal systems.

Abstract

Recent advances in fine-tuning multimodal large language models (MLLMs) using reinforcement learning have achieved remarkable progress, particularly with the introduction of various entropy control techniques. However, the role and characteristics of entropy in perception-oriented tasks like visual grounding, as well as effective strategies for controlling it, remain largely unexplored. To address this issue, we focus on the visual grounding task and analyze the role and characteristics of entropy in comparison to reasoning tasks. Building on these findings, we introduce ECVGPO (Entropy Control Visual Grounding Policy Optimization), an interpretable algorithm designed for effective entropy regulation. Through entropy control, the trade-off between exploration and exploitation is better balanced. Experiments show that ECVGPO achieves broad improvements across various benchmarks and models.

Paper Structure

This paper contains 42 sections, 26 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: Comparison of entropy between reasoning and grounding tasks. Darker red indicates higher token entropy. In reasoning tasks, entropy rapidly decreases after training (entropy collapse), whereas in grounding tasks, it remains consistently high.
  • Figure 2: Word cloud of tokens with twice the average entropy.
  • Figure 3: Comparison of the resoning-like reward and grounding-like reward.
  • Figure 4: Comparison of the average probability of all tokens and numeric tokens in positive advantage samples during VLM-R1's training.
  • Figure 5: Comparison of the training time of GRPO and our method.
  • ...and 3 more figures