Reinforced Attention Learning

Bangzheng Li; Jianmo Ni; Chen Qu; Ian Miao; Liu Yang; Xingyu Fu; Muhao Chen; Derek Zhiyuan Cheng

Reinforced Attention Learning

Bangzheng Li, Jianmo Ni, Chen Qu, Ian Miao, Liu Yang, Xingyu Fu, Muhao Chen, Derek Zhiyuan Cheng

TL;DR

This work proposes Reinforced Attention Learning (RAL), a policy-gradient framework that directly optimizes internal attention distributions rather than output token sequences, and introduces On-Policy Attention Distillation, demonstrating that transferring latent attention behaviors yields stronger cross-modal alignment than standard knowledge distillation.

Abstract

Post-training with Reinforcement Learning (RL) has substantially improved reasoning in Large Language Models (LLMs) via test-time scaling. However, extending this paradigm to Multimodal LLMs (MLLMs) through verbose rationales yields limited gains for perception and can even degrade performance. We propose Reinforced Attention Learning (RAL), a policy-gradient framework that directly optimizes internal attention distributions rather than output token sequences. By shifting optimization from what to generate to where to attend, RAL promotes effective information allocation and improved grounding in complex multimodal inputs. Experiments across diverse image and video benchmarks show consistent gains over GRPO and other baselines. We further introduce On-Policy Attention Distillation, demonstrating that transferring latent attention behaviors yields stronger cross-modal alignment than standard knowledge distillation. Our results position attention policies as a principled and general alternative for multimodal post-training.

Reinforced Attention Learning

TL;DR

Abstract

Paper Structure (35 sections, 11 equations, 3 figures, 3 tables)

This paper contains 35 sections, 11 equations, 3 figures, 3 tables.

Introduction
Related Works
Post training LLMs through Reinforcement Learning
Distilling knowledge and beyond from teacher to student models
Reinforced Attention Learning
Aggregated causal Attention Distribution Policy
Advantage-Weighted Attention Divergence
Combined Optimization Objective
Gradient Derivation
Gradient w.r.t. Distribution.
Gradient w.r.t. Logits.
Total Parameter Update.
On-Policy Attention Distillation
Teacher-Student Alignment.
Unified Distillation Objective.
...and 20 more sections

Figures (3)

Figure 1: Reinforced Attention Learning formulates internal attention distributions as a policy. Unlike traditional methods that optimize next-token probabilities ("what to generate"), our approach prioritizes the selective allocation of information ("where to focus"). By optimizing for the advantage, the model explores a high-reward attention policy that effectively isolates salient information from dense contexts.
Figure 2: Sample data of the SFT and RL training stages. The SFT stage adapts the model to a "think-and-answer" paradigm, while the RL stage employs a reward function to verify the format and correctness of the rollout responses.
Figure 3: RAL improves GRPO along the increasing video frames or image resolution.

Reinforced Attention Learning

TL;DR

Abstract

Reinforced Attention Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (3)