Praxis-VLM: Vision-Grounded Decision Making via Text-Driven Reinforcement Learning
Zhe Hu, Jing Li, Zhongzhu Pu, Hou Pong Chan, Yu Yin
TL;DR
The paper tackles how vision-language models can achieve robust vision-grounded decision-making by decoupling reasoning from perception and learning it from language alone. It introduces Praxis-VLM, which trains a reasoning policy on a text-only decision-making corpus using Group Relative Policy Optimization with a multi-stage adaptive reward, then transfers the learned reasoning to multimodal inference. Empirical results across VIVA, PCA-Bench, and EgoNormia show Praxis-VLM outperforms vanilla VLMs and SFT baselines, with strong generalization to out-of-domain scenarios and clear explicit-reasoning behavior. This work offers a data-efficient pathway for constructing generalizable decision-making capabilities in embodied AI, leveraging language to instill transferable reasoning that can be grounded in vision at inference time.
Abstract
Vision Language Models exhibit impressive performance for various tasks, yet they often lack the sophisticated situational reasoning required for complex decision-making. This paper shows that VLMs can achieve surprisingly strong decision-making performance when visual scenes are replaced by textual descriptions, suggesting foundational reasoning can be effectively learned from language. Motivated by this insight, we propose Praxis-VLM, a reasoning VLM for vision-grounded decision-making. Praxis-VLM employs the GRPO algorithm on textual scenarios to instill robust reasoning capabilities, where models learn to evaluate actions and their consequences. These reasoning skills, acquired purely from text, successfully transfer to multimodal inference with visual inputs, significantly reducing reliance on scarce paired image-text training data. Experiments across diverse decision-making benchmarks demonstrate that Praxis-VLM substantially outperforms standard supervised fine-tuning, exhibiting superior performance and generalizability. Further analysis confirms that our models engage in explicit and effective reasoning, underpinning their enhanced performance and adaptability.
