Table of Contents
Fetching ...

Rule-Based Reinforcement Learning for Document Image Classification with Vision Language Models

Michael Jungo, Andreas Fischer

TL;DR

This paper investigates applying rule-based reinforcement learning (RL) to document image classification using Vision-Language Models (VLMs), proposing verifiable rewards via GRPO and RLVR as alternatives to supervised fine-tuning (SFT). Using LoRA adapters with $4$-bit quantization (QLoRA) and a group sampling size of $G=8$, the authors evaluate on RVL-CDIP and a born-digital variant, highlighting that RL improves generalisation to out-of-distribution data and unseen classes, while face challenges in cross-modality transfers. They also analyze the role of reasoning traces, showing that including reasoning improves test-time performance but can complicate formatting and training stability. Overall, the work suggests RL is a viable option for verifiable downstream tasks with potential explainability benefits, albeit at the cost of training efficiency and stability improvements that need further research.

Abstract

Rule-based reinforcement learning has been gaining popularity ever since DeepSeek-R1 has demonstrated its success through simple verifiable rewards. In the domain of document analysis, reinforcement learning is not as prevalent, even though many downstream tasks may benefit from the emerging properties of reinforcement learning, particularly the enhanced reason capabilities. We study the effects of rule-based reinforcement learning with the task of Document Image Classification which is one of the most commonly studied downstream tasks in document analysis. We find that reinforcement learning tends to have better generalisation capabilities to out-of-distritbution data, which we examine in three different scenarios, namely out-of-distribution images, unseen classes and different modalities. Our code is available at https://github.com/jungomi/vision-finetune.

Rule-Based Reinforcement Learning for Document Image Classification with Vision Language Models

TL;DR

This paper investigates applying rule-based reinforcement learning (RL) to document image classification using Vision-Language Models (VLMs), proposing verifiable rewards via GRPO and RLVR as alternatives to supervised fine-tuning (SFT). Using LoRA adapters with -bit quantization (QLoRA) and a group sampling size of , the authors evaluate on RVL-CDIP and a born-digital variant, highlighting that RL improves generalisation to out-of-distribution data and unseen classes, while face challenges in cross-modality transfers. They also analyze the role of reasoning traces, showing that including reasoning improves test-time performance but can complicate formatting and training stability. Overall, the work suggests RL is a viable option for verifiable downstream tasks with potential explainability benefits, albeit at the cost of training efficiency and stability improvements that need further research.

Abstract

Rule-based reinforcement learning has been gaining popularity ever since DeepSeek-R1 has demonstrated its success through simple verifiable rewards. In the domain of document analysis, reinforcement learning is not as prevalent, even though many downstream tasks may benefit from the emerging properties of reinforcement learning, particularly the enhanced reason capabilities. We study the effects of rule-based reinforcement learning with the task of Document Image Classification which is one of the most commonly studied downstream tasks in document analysis. We find that reinforcement learning tends to have better generalisation capabilities to out-of-distritbution data, which we examine in three different scenarios, namely out-of-distribution images, unseen classes and different modalities. Our code is available at https://github.com/jungomi/vision-finetune.

Paper Structure

This paper contains 21 sections, 3 equations, 1 figure, 5 tables.

Figures (1)

  • Figure 1: Reasoning Examples. Predictions of LLama-3.2-11B-Vision-Instruct trained with reinforcement learning (RL) with the reasoning the model provided in the response before giving its final answer.