Table of Contents
Fetching ...

GFlowVLM: Enhancing Multi-step Reasoning in Vision-Language Models with Generative Flow Networks

Haoqiang Kang, Enna Sachdeva, Piyush Gupta, Sangjae Bae, Kwonjoon Lee

TL;DR

GFlowVLM introduces a Generative Flow Network (GFlowNet)–based fine-tuning framework for vision-language models to improve multi-step reasoning in multimodal environments. By treating reasoning as non-Markovian trajectories and guiding action selection with chain-of-thought augmented prompts, the method promotes diverse, high-reward reasoning paths and leverages off-policy data for efficiency. The approach outperforms SFT and PPO-based baselines across NumberLine, Blackjack, and ALFWorld in terms of success, diversity, and generalization to OOD scenarios, while also achieving faster learning. The work demonstrates the value of sampling trajectories proportional to rewards to explore a richer space of reasoning strategies, with implications for embodied AI and complex planning tasks.

Abstract

Vision-Language Models (VLMs) have recently shown promising advancements in sequential decision-making tasks through task-specific fine-tuning. However, common fine-tuning methods, such as Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) techniques like Proximal Policy Optimization (PPO), present notable limitations: SFT assumes Independent and Identically Distributed (IID) data, while PPO focuses on maximizing cumulative rewards. These limitations often restrict solution diversity and hinder generalization in multi-step reasoning tasks. To address these challenges, we introduce a novel framework, GFlowVLM, a framework that fine-tune VLMs using Generative Flow Networks (GFlowNets) to promote generation of diverse solutions for complex reasoning tasks. GFlowVLM models the environment as a non-Markovian decision process, allowing it to capture long-term dependencies essential for real-world applications. It takes observations and task descriptions as inputs to prompt chain-of-thought (CoT) reasoning which subsequently guides action selection. We use task based rewards to fine-tune VLM with GFlowNets. This approach enables VLMs to outperform prior fine-tuning methods, including SFT and RL. Empirical results demonstrate the effectiveness of GFlowVLM on complex tasks such as card games (NumberLine, BlackJack) and embodied planning tasks (ALFWorld), showing enhanced training efficiency, solution diversity, and stronger generalization capabilities across both in-distribution and out-of-distribution scenarios.

GFlowVLM: Enhancing Multi-step Reasoning in Vision-Language Models with Generative Flow Networks

TL;DR

GFlowVLM introduces a Generative Flow Network (GFlowNet)–based fine-tuning framework for vision-language models to improve multi-step reasoning in multimodal environments. By treating reasoning as non-Markovian trajectories and guiding action selection with chain-of-thought augmented prompts, the method promotes diverse, high-reward reasoning paths and leverages off-policy data for efficiency. The approach outperforms SFT and PPO-based baselines across NumberLine, Blackjack, and ALFWorld in terms of success, diversity, and generalization to OOD scenarios, while also achieving faster learning. The work demonstrates the value of sampling trajectories proportional to rewards to explore a richer space of reasoning strategies, with implications for embodied AI and complex planning tasks.

Abstract

Vision-Language Models (VLMs) have recently shown promising advancements in sequential decision-making tasks through task-specific fine-tuning. However, common fine-tuning methods, such as Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) techniques like Proximal Policy Optimization (PPO), present notable limitations: SFT assumes Independent and Identically Distributed (IID) data, while PPO focuses on maximizing cumulative rewards. These limitations often restrict solution diversity and hinder generalization in multi-step reasoning tasks. To address these challenges, we introduce a novel framework, GFlowVLM, a framework that fine-tune VLMs using Generative Flow Networks (GFlowNets) to promote generation of diverse solutions for complex reasoning tasks. GFlowVLM models the environment as a non-Markovian decision process, allowing it to capture long-term dependencies essential for real-world applications. It takes observations and task descriptions as inputs to prompt chain-of-thought (CoT) reasoning which subsequently guides action selection. We use task based rewards to fine-tune VLM with GFlowNets. This approach enables VLMs to outperform prior fine-tuning methods, including SFT and RL. Empirical results demonstrate the effectiveness of GFlowVLM on complex tasks such as card games (NumberLine, BlackJack) and embodied planning tasks (ALFWorld), showing enhanced training efficiency, solution diversity, and stronger generalization capabilities across both in-distribution and out-of-distribution scenarios.

Paper Structure

This paper contains 54 sections, 18 equations, 7 figures, 11 tables, 1 algorithm.

Figures (7)

  • Figure 1: Overview of the prediction of diverse sequence using Gflownets as compared to PPO. The model takes the image of sequence and prompt as input, and generates the next number of sequence by implicitly modeling the causality. See \ref{['fig:teaser_alfworld']} for a practical example.
  • Figure 2: Overall framework of proposed method: The input $z_{0:t}$ at time step $t$ consists of a visual observation $o_t$ and an input prompt $p_t$ containing goal description, history states $s_{0:t}$, history actions $a_{0:t}$, and admissible actions $\mathcal{A}_{t}$, and outputs CoT reasoning $c_t$, and action $a_t$. The $a_t$ is executed in the environment to obtain reward $r_t(s_t, a_t)$, next observation $o_{t+1}$, and action space $\mathcal{A}_{t+1}$. $f$ generates the next prompt $p_{t+1}$ using description of next observation $o_{t+1}$ (if applicable), history of states $s_{0:t}$ and actions $a_{0:t}$ and next admissible actions $\mathcal{A}_{t+1}$. This generates multiple trajectories. The transitions $<s_t, a_t, r_t, c_t>$ , $<s_t', a_t', r_t', c_t'>$ and $<s_t", a_t", r_t", c_t">$ across different trajectories are added to buffer to update the forward policy $P_{F}$ using GFlowNets. $\{ x, x', x"\} \in \mathcal{X}$ represent the terminal states of sequences. $R(x)$ represents the non-negative reward obtained from the environment (after reward shaping, if applicable) at terminal state $x$ of a trajectory.
  • Figure 3: Training curves showing in-distribution episode success rates ($\%$) across three tasks. For Numberline and BlackJack, RL4VLM is trained with the original reward, while GFlowVLM variants use a revised reward function, as RL4VLM serves as a strong baseline under original rewards. In ALFWorld, all methods use the same (original) reward without revision. Models are trained using on-policy sampling.
  • Figure 4: Overview of the prediction of diverse sequence using GFlowVLMs as compared to PPO for AlfWorld scenarios. The model takes the image of sequence and prompt as input, and generates the next number of sequence by implicitly modeling the causality.
  • Figure 5: Average success rates (%) of our method under different CoT weighting factor $\lambda$ on NumberLine across three loss functions.
  • ...and 2 more figures