Table of Contents
Fetching ...

RL-CFR: Improving Action Abstraction for Imperfect Information Extensive-Form Games with Reinforcement Learning

Boning Li, Zhixuan Fang, Longbo Huang

Abstract

Effective action abstraction is crucial in tackling challenges associated with large action spaces in Imperfect Information Extensive-Form Games (IIEFGs). However, due to the vast state space and computational complexity in IIEFGs, existing methods often rely on fixed abstractions, resulting in sub-optimal performance. In response, we introduce RL-CFR, a novel reinforcement learning (RL) approach for dynamic action abstraction. RL-CFR builds upon our innovative Markov Decision Process (MDP) formulation, with states corresponding to public information and actions represented as feature vectors indicating specific action abstractions. The reward is defined as the expected payoff difference between the selected and default action abstractions. RL-CFR constructs a game tree with RL-guided action abstractions and utilizes counterfactual regret minimization (CFR) for strategy derivation. Impressively, it can be trained from scratch, achieving higher expected payoff without increased CFR solving time. In experiments on Heads-up No-limit Texas Hold'em, RL-CFR outperforms ReBeL's replication and Slumbot, demonstrating significant win-rate margins of $64\pm 11$ and $84\pm 17$ mbb/hand, respectively.

RL-CFR: Improving Action Abstraction for Imperfect Information Extensive-Form Games with Reinforcement Learning

Abstract

Effective action abstraction is crucial in tackling challenges associated with large action spaces in Imperfect Information Extensive-Form Games (IIEFGs). However, due to the vast state space and computational complexity in IIEFGs, existing methods often rely on fixed abstractions, resulting in sub-optimal performance. In response, we introduce RL-CFR, a novel reinforcement learning (RL) approach for dynamic action abstraction. RL-CFR builds upon our innovative Markov Decision Process (MDP) formulation, with states corresponding to public information and actions represented as feature vectors indicating specific action abstractions. The reward is defined as the expected payoff difference between the selected and default action abstractions. RL-CFR constructs a game tree with RL-guided action abstractions and utilizes counterfactual regret minimization (CFR) for strategy derivation. Impressively, it can be trained from scratch, achieving higher expected payoff without increased CFR solving time. In experiments on Heads-up No-limit Texas Hold'em, RL-CFR outperforms ReBeL's replication and Slumbot, demonstrating significant win-rate margins of and mbb/hand, respectively.
Paper Structure (13 sections, 4 equations, 3 figures, 2 tables, 2 algorithms)

This paper contains 13 sections, 4 equations, 3 figures, 2 tables, 2 algorithms.

Figures (3)

  • Figure 1: The game starts with a chance state where Player $1$ faces an equal probability of receiving either $J$ or $K$, and Player $2$ is consistently dealt $Q$. If Player $1$ is dealt $K$, an automatic all-in is initiated. When Player $1$ holds $J$, a pivotal decision arises between opting for a cautious check or committing to an all-in strategy. Subsequently, if Player $1$ opts for an all-in move, Player $2$ is confronted with the dilemma of whether to fold or call. Importantly, Player $2$ remains uninformed about the specific cards held by Player $1$, resulting in an information set that encompasses two states. Upon reaching the terminal state, payoffs are assigned to both players in accordance with a predefined assignment rule.
  • Figure 2: Training procedure for the RL-CFR framework. The labels in the figure correspond to the sampling steps for RL-CFR framework. A sampling epoch starts from the initial PBS $\beta_{\text{init}}$.
  • Figure 3: This figure illustrates the process of generating PBS data and training the PBS value network. For a given PBS $\beta$, we construct a depth-limited subgame rooted with $\beta$. Non-terminal and non-leaf nodes are depicted as circles, and during the construction of the game tree, we expand child nodes based on the action abstraction of the PBS associated with the node. Terminal nodes, denoted by diamonds, allow for direct calculation of the PBS value. Leaf nodes, represented by rectangles, require the estimation of PBS values in every iteration of CFR, where the PBS value network is employed to estimate the PBS values for these leaf nodes.