GFlowVLM: Enhancing Multi-step Reasoning in Vision-Language Models with Generative Flow Networks

Haoqiang Kang; Enna Sachdeva; Piyush Gupta; Sangjae Bae; Kwonjoon Lee

GFlowVLM: Enhancing Multi-step Reasoning in Vision-Language Models with Generative Flow Networks

Haoqiang Kang, Enna Sachdeva, Piyush Gupta, Sangjae Bae, Kwonjoon Lee

TL;DR

GFlowVLM introduces a Generative Flow Network (GFlowNet)–based fine-tuning framework for vision-language models to improve multi-step reasoning in multimodal environments. By treating reasoning as non-Markovian trajectories and guiding action selection with chain-of-thought augmented prompts, the method promotes diverse, high-reward reasoning paths and leverages off-policy data for efficiency. The approach outperforms SFT and PPO-based baselines across NumberLine, Blackjack, and ALFWorld in terms of success, diversity, and generalization to OOD scenarios, while also achieving faster learning. The work demonstrates the value of sampling trajectories proportional to rewards to explore a richer space of reasoning strategies, with implications for embodied AI and complex planning tasks.

Abstract

Vision-Language Models (VLMs) have recently shown promising advancements in sequential decision-making tasks through task-specific fine-tuning. However, common fine-tuning methods, such as Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) techniques like Proximal Policy Optimization (PPO), present notable limitations: SFT assumes Independent and Identically Distributed (IID) data, while PPO focuses on maximizing cumulative rewards. These limitations often restrict solution diversity and hinder generalization in multi-step reasoning tasks. To address these challenges, we introduce a novel framework, GFlowVLM, a framework that fine-tune VLMs using Generative Flow Networks (GFlowNets) to promote generation of diverse solutions for complex reasoning tasks. GFlowVLM models the environment as a non-Markovian decision process, allowing it to capture long-term dependencies essential for real-world applications. It takes observations and task descriptions as inputs to prompt chain-of-thought (CoT) reasoning which subsequently guides action selection. We use task based rewards to fine-tune VLM with GFlowNets. This approach enables VLMs to outperform prior fine-tuning methods, including SFT and RL. Empirical results demonstrate the effectiveness of GFlowVLM on complex tasks such as card games (NumberLine, BlackJack) and embodied planning tasks (ALFWorld), showing enhanced training efficiency, solution diversity, and stronger generalization capabilities across both in-distribution and out-of-distribution scenarios.

GFlowVLM: Enhancing Multi-step Reasoning in Vision-Language Models with Generative Flow Networks

TL;DR

Abstract

GFlowVLM: Enhancing Multi-step Reasoning in Vision-Language Models with Generative Flow Networks

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (7)