Point-RFT: Improving Multimodal Reasoning with Visually Grounded Reinforcement Finetuning

Minheng Ni; Zhengyuan Yang; Linjie Li; Chung-Ching Lin; Kevin Lin; Wangmeng Zuo; Lijuan Wang

Point-RFT: Improving Multimodal Reasoning with Visually Grounded Reinforcement Finetuning

Minheng Ni, Zhengyuan Yang, Linjie Li, Chung-Ching Lin, Kevin Lin, Wangmeng Zuo, Lijuan Wang

TL;DR

Point-RFT tackles the challenge of multimodal reasoning by grounding chain-of-thought in visual elements. It introduces a two-stage training pipeline: Point-CoT format finetuning on a 71K visual reasoning dataset and reinforcement finetuning with GRPO on ChartQA, achieving a ChartQA accuracy of 90.04% and strong out-of-domain generalization. The approach emphasizes interpretability and perception-reasoning alignment through explicit visual grounding, backed by extensive ablations and qualitative analyses. The work also provides a 71K-sample Point-CoT dataset to facilitate future research in grounded multimodal reasoning and demonstrates practical improvements for complex visual document understanding tasks.

Abstract

Recent advances in large language models have significantly improved textual reasoning through the effective use of Chain-of-Thought (CoT) and reinforcement learning. However, extending these successes to vision-language tasks remains challenging due to inherent limitations in text-only CoT, such as visual hallucinations and insufficient multimodal integration. In this paper, we introduce Point-RFT, a multimodal reasoning framework explicitly designed to leverage visually grounded CoT reasoning for visual document understanding. Our approach consists of two stages: First, we conduct format finetuning using a curated dataset of 71K diverse visual reasoning problems, each annotated with detailed, step-by-step rationales explicitly grounded to corresponding visual elements. Second, we employ reinforcement finetuning targeting visual document understanding. On ChartQA, our approach improves accuracy from 70.88% (format-finetuned baseline) to 90.04%, surpassing the 83.92% accuracy achieved by reinforcement finetuning relying solely on text-based CoT. The result shows that our grounded CoT is more effective for multimodal reasoning compared with the text-only CoT. Moreover, Point-RFT exhibits superior generalization capability across several out-of-domain visual document reasoning benchmarks, including CharXiv, PlotQA, IconQA, TabMWP, etc., and highlights its potential in complex real-world scenarios.

Point-RFT: Improving Multimodal Reasoning with Visually Grounded Reinforcement Finetuning

TL;DR

Abstract

Point-RFT: Improving Multimodal Reasoning with Visually Grounded Reinforcement Finetuning

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (7)