Do GFlowNets Transfer? Case Study on the Game of 24/42
Adesh Gupta, Abhinav Kumar, Mansi Gupta, Paras Chopra
TL;DR
This study interrogates whether GFlowNets trained to generate diverse, high-reward solutions in a reasoning game can transfer to a closely related task. By fine-tuning small to medium LLMs (e.g., LLaMA variants) with a GFlowNet objective on the Game of 24 and evaluating on the Game of 42 under varied decoding strategies and temperatures, it isolates accuracy and diversity as separate signals. The findings show strong in-distribution gains on Game of 24, with some transfer to Game of 42 only under particular hyperparameter configurations, indicating limited zero-shot transfer and a high sensitivity to hyperparameters. The work highlights the need for improved transfer learning mechanisms and larger-scale evaluations to realize robust cross-task diversity in LLMs.
Abstract
Generating diverse solutions is key to human-like reasoning, yet autoregressive language models focus on single accurate responses, limiting creativity. GFlowNets optimize solution generation as a flow network, promising greater diversity. Our case study shows their limited zero-shot transferability by fine-tuning small and medium-sized large language models on the Game of 24 and testing them on the Game of 42 datasets. Results revealed that GFlowNets struggle to maintain solution diversity and accuracy, highlighting key limitations in their cross-task generalization and the need for future research in improved transfer learning capabilities.
