Table of Contents
Fetching ...

Point-RFT: Improving Multimodal Reasoning with Visually Grounded Reinforcement Finetuning

Minheng Ni, Zhengyuan Yang, Linjie Li, Chung-Ching Lin, Kevin Lin, Wangmeng Zuo, Lijuan Wang

TL;DR

Point-RFT tackles the challenge of multimodal reasoning by grounding chain-of-thought in visual elements. It introduces a two-stage training pipeline: Point-CoT format finetuning on a 71K visual reasoning dataset and reinforcement finetuning with GRPO on ChartQA, achieving a ChartQA accuracy of 90.04% and strong out-of-domain generalization. The approach emphasizes interpretability and perception-reasoning alignment through explicit visual grounding, backed by extensive ablations and qualitative analyses. The work also provides a 71K-sample Point-CoT dataset to facilitate future research in grounded multimodal reasoning and demonstrates practical improvements for complex visual document understanding tasks.

Abstract

Recent advances in large language models have significantly improved textual reasoning through the effective use of Chain-of-Thought (CoT) and reinforcement learning. However, extending these successes to vision-language tasks remains challenging due to inherent limitations in text-only CoT, such as visual hallucinations and insufficient multimodal integration. In this paper, we introduce Point-RFT, a multimodal reasoning framework explicitly designed to leverage visually grounded CoT reasoning for visual document understanding. Our approach consists of two stages: First, we conduct format finetuning using a curated dataset of 71K diverse visual reasoning problems, each annotated with detailed, step-by-step rationales explicitly grounded to corresponding visual elements. Second, we employ reinforcement finetuning targeting visual document understanding. On ChartQA, our approach improves accuracy from 70.88% (format-finetuned baseline) to 90.04%, surpassing the 83.92% accuracy achieved by reinforcement finetuning relying solely on text-based CoT. The result shows that our grounded CoT is more effective for multimodal reasoning compared with the text-only CoT. Moreover, Point-RFT exhibits superior generalization capability across several out-of-domain visual document reasoning benchmarks, including CharXiv, PlotQA, IconQA, TabMWP, etc., and highlights its potential in complex real-world scenarios.

Point-RFT: Improving Multimodal Reasoning with Visually Grounded Reinforcement Finetuning

TL;DR

Point-RFT tackles the challenge of multimodal reasoning by grounding chain-of-thought in visual elements. It introduces a two-stage training pipeline: Point-CoT format finetuning on a 71K visual reasoning dataset and reinforcement finetuning with GRPO on ChartQA, achieving a ChartQA accuracy of 90.04% and strong out-of-domain generalization. The approach emphasizes interpretability and perception-reasoning alignment through explicit visual grounding, backed by extensive ablations and qualitative analyses. The work also provides a 71K-sample Point-CoT dataset to facilitate future research in grounded multimodal reasoning and demonstrates practical improvements for complex visual document understanding tasks.

Abstract

Recent advances in large language models have significantly improved textual reasoning through the effective use of Chain-of-Thought (CoT) and reinforcement learning. However, extending these successes to vision-language tasks remains challenging due to inherent limitations in text-only CoT, such as visual hallucinations and insufficient multimodal integration. In this paper, we introduce Point-RFT, a multimodal reasoning framework explicitly designed to leverage visually grounded CoT reasoning for visual document understanding. Our approach consists of two stages: First, we conduct format finetuning using a curated dataset of 71K diverse visual reasoning problems, each annotated with detailed, step-by-step rationales explicitly grounded to corresponding visual elements. Second, we employ reinforcement finetuning targeting visual document understanding. On ChartQA, our approach improves accuracy from 70.88% (format-finetuned baseline) to 90.04%, surpassing the 83.92% accuracy achieved by reinforcement finetuning relying solely on text-based CoT. The result shows that our grounded CoT is more effective for multimodal reasoning compared with the text-only CoT. Moreover, Point-RFT exhibits superior generalization capability across several out-of-domain visual document reasoning benchmarks, including CharXiv, PlotQA, IconQA, TabMWP, etc., and highlights its potential in complex real-world scenarios.

Paper Structure

This paper contains 22 sections, 2 equations, 7 figures, 9 tables.

Figures (7)

  • Figure 1: Point-RFT improves multimodal reinforcement finetuning with visually grounded CoT. (1) We first construct a Point-CoT dataset for Supervised Format Finetuning (SFT). It allows the model to generate step-by-step reasoning traces explicitly linked to visual pointing, mitigating hallucinations and enhancing perception-reasoning alignment. (2) Reinforcement Finetuning (RFT) with GRPO: Optimizes answer correctness and grounded rationale coherence by rewarding localized visual-textual reasoning paths.
  • Figure 2: Visualization of Point-CoT dataset. Point-CoT dataset integrates the reasoning process of answering questions with point grounding, creating a novel form of multimodal CoT.
  • Figure 3: Overall dataset generation pipeline. The whole construction process combining LLM reasoning (GPT-4o) and geometric grounding (Molmo-7B). The pipeline ensures spatial-textual consistency through cross-validation, producing our Point-CoT dataset.
  • Figure 4: Cases of In-domain Chart. These cases highlight the limitations of pure text reasoning in analyzing complex visual elements. Point-RFT successfully reasons by integrating the visual content.
  • Figure 5: Case of OOD Chart. Point-RFT successfully transfers coordinate referencing skills learned from bar charts.
  • ...and 2 more figures