Table of Contents
Fetching ...

Praxis-VLM: Vision-Grounded Decision Making via Text-Driven Reinforcement Learning

Zhe Hu, Jing Li, Zhongzhu Pu, Hou Pong Chan, Yu Yin

TL;DR

The paper tackles how vision-language models can achieve robust vision-grounded decision-making by decoupling reasoning from perception and learning it from language alone. It introduces Praxis-VLM, which trains a reasoning policy on a text-only decision-making corpus using Group Relative Policy Optimization with a multi-stage adaptive reward, then transfers the learned reasoning to multimodal inference. Empirical results across VIVA, PCA-Bench, and EgoNormia show Praxis-VLM outperforms vanilla VLMs and SFT baselines, with strong generalization to out-of-domain scenarios and clear explicit-reasoning behavior. This work offers a data-efficient pathway for constructing generalizable decision-making capabilities in embodied AI, leveraging language to instill transferable reasoning that can be grounded in vision at inference time.

Abstract

Vision Language Models exhibit impressive performance for various tasks, yet they often lack the sophisticated situational reasoning required for complex decision-making. This paper shows that VLMs can achieve surprisingly strong decision-making performance when visual scenes are replaced by textual descriptions, suggesting foundational reasoning can be effectively learned from language. Motivated by this insight, we propose Praxis-VLM, a reasoning VLM for vision-grounded decision-making. Praxis-VLM employs the GRPO algorithm on textual scenarios to instill robust reasoning capabilities, where models learn to evaluate actions and their consequences. These reasoning skills, acquired purely from text, successfully transfer to multimodal inference with visual inputs, significantly reducing reliance on scarce paired image-text training data. Experiments across diverse decision-making benchmarks demonstrate that Praxis-VLM substantially outperforms standard supervised fine-tuning, exhibiting superior performance and generalizability. Further analysis confirms that our models engage in explicit and effective reasoning, underpinning their enhanced performance and adaptability.

Praxis-VLM: Vision-Grounded Decision Making via Text-Driven Reinforcement Learning

TL;DR

The paper tackles how vision-language models can achieve robust vision-grounded decision-making by decoupling reasoning from perception and learning it from language alone. It introduces Praxis-VLM, which trains a reasoning policy on a text-only decision-making corpus using Group Relative Policy Optimization with a multi-stage adaptive reward, then transfers the learned reasoning to multimodal inference. Empirical results across VIVA, PCA-Bench, and EgoNormia show Praxis-VLM outperforms vanilla VLMs and SFT baselines, with strong generalization to out-of-domain scenarios and clear explicit-reasoning behavior. This work offers a data-efficient pathway for constructing generalizable decision-making capabilities in embodied AI, leveraging language to instill transferable reasoning that can be grounded in vision at inference time.

Abstract

Vision Language Models exhibit impressive performance for various tasks, yet they often lack the sophisticated situational reasoning required for complex decision-making. This paper shows that VLMs can achieve surprisingly strong decision-making performance when visual scenes are replaced by textual descriptions, suggesting foundational reasoning can be effectively learned from language. Motivated by this insight, we propose Praxis-VLM, a reasoning VLM for vision-grounded decision-making. Praxis-VLM employs the GRPO algorithm on textual scenarios to instill robust reasoning capabilities, where models learn to evaluate actions and their consequences. These reasoning skills, acquired purely from text, successfully transfer to multimodal inference with visual inputs, significantly reducing reliance on scarce paired image-text training data. Experiments across diverse decision-making benchmarks demonstrate that Praxis-VLM substantially outperforms standard supervised fine-tuning, exhibiting superior performance and generalizability. Further analysis confirms that our models engage in explicit and effective reasoning, underpinning their enhanced performance and adaptability.

Paper Structure

This paper contains 24 sections, 1 equation, 12 figures, 3 tables.

Figures (12)

  • Figure 1: Illustrative examples of Praxis-VLM's decision-making process. Employing text-driven training, Praxis-VLM performs sophisticated reasoning by analyzing visual situations, posing relevant questions, and generating reasoned textual responses to support multimodal decision-making.
  • Figure 2: Model accuracy on VIVA hu-etal-2024-viva and PCA-Bench chen-etal-2024-pca. Image Situation uses the original image as input, and Text Situation employs the caption (text) instead.
  • Figure 3: Overview of Praxis-VLM: Learning transferable reasoning from text-only data for multimodal decision-making. The process involves (1) constructing synthetic text-based training data where situations are represented through textual descriptions, (2) training the VLM on this data using RL with adaptive rewards to develop reasoning skills, and (3) transferring the learned reasoning to vision-grounded decision-making tasks during inference.
  • Figure 4: Accuracy versus reasoning length on VIVA and EgoNormia. Samples are grouped into 5 quintile bins based on the reasoning length percentile generated by Praxis-VLM (Len1: shortest 20%, Len5: longest 20%).
  • Figure 5: Dominant reasoning dimensions used by Praxis-VLM in decision-making. Clusters were identified by analyzing keyphrases generated by GPT-4o from the model's reasoning chains.
  • ...and 7 more figures