Hierarchical Deep Counterfactual Regret Minimization

Jiayu Chen; Zhekai Wang; Vaneet Aggarwal

Hierarchical Deep Counterfactual Regret Minimization

Jiayu Chen, Zhekai Wang, Vaneet Aggarwal

TL;DR

This work advances imperfect-information game learning by integrating skill-based hierarchies with Counterfactual Regret Minimization. It introduces Hierarchical CFR (HCFR) in a tabular setting, augmented with a low-variance Monte Carlo extension, and then scales to deep tasks via Hierarchical Deep CFR (HDCFR) using neural networks to approximate regrets, strategies, and baselines. Theoretical guarantees include convergence to Nash Equilibria under HCFR and unbiased, variance-reduced regret estimators in MCCFR, with practical benefits such as skill transfer and human-expert injections. Empirically, HDCFR outperforms state-of-the-art model-free baselines on long-horizon two-player zero-sum IIGs (e.g., Leduc, FHP), and ablations confirm the necessity of the hierarchical structure, variance reduction, and sampling design for robust performance.

Abstract

Imperfect Information Games (IIGs) offer robust models for scenarios where decision-makers face uncertainty or lack complete information. Counterfactual Regret Minimization (CFR) has been one of the most successful family of algorithms for tackling IIGs. The integration of skill-based strategy learning with CFR could potentially mirror more human-like decision-making process and enhance the learning performance for complex IIGs. It enables the learning of a hierarchical strategy, wherein low-level components represent skills for solving subgames and the high-level component manages the transition between skills. In this paper, we introduce the first hierarchical version of Deep CFR (HDCFR), an innovative method that boosts learning efficiency in tasks involving extensively large state spaces and deep game trees. A notable advantage of HDCFR over previous works is its ability to facilitate learning with predefined (human) expertise and foster the acquisition of skills that can be transferred to similar tasks. To achieve this, we initially construct our algorithm on a tabular setting, encompassing hierarchical CFR updating rules and a variance-reduced Monte Carlo sampling extension. Notably, we offer the theoretical justifications, including the convergence rate of the proposed updating rule, the unbiasedness of the Monte Carlo regret estimator, and ideal criteria for effective variance reduction. Then, we employ neural networks as function approximators and develop deep learning objectives to adapt our proposed algorithms for large-scale tasks, while maintaining the theoretical support.

Hierarchical Deep Counterfactual Regret Minimization

TL;DR

Abstract

Paper Structure (24 sections, 22 theorems, 79 equations, 3 figures, 3 tables, 2 algorithms)

This paper contains 24 sections, 22 theorems, 79 equations, 3 figures, 3 tables, 2 algorithms.

Introduction
Background
Counterfactual Regret Minimization
The Option Framework
Methodology
Preliminaries
Hierarchical Counterfactual Regret Minimization
Low-Variance Monte Carlo Sampling Extension
Hierarchical Deep Counterfactual Regret Minimization
Evaluation and Main Results
Comparison with State-of-the-Art Model-free Algorithms for Zero-sum IIGs
Ablation Analysis
Case Study: Delving into the Learned Hierarchical Strategy
Related Work
Conclusion
...and 9 more sections

Key Result

Theorem 1

In a two-player zero-sum game at time $T$, if both players' average overall regret is less than $\epsilon$, then $\overline{\sigma}^T=\{\overline{\sigma}^T_1, \overline{\sigma}^T_2\}$ is a $2\epsilon$-Nash Equilibrium.

Figures (3)

Figure 1: Performance comparison on Leduc poker games. Lower exploitability indicates a closer approximation to the Nash Equilibrium. While HDCFR matches baseline performance in simpler scenarios, it exhibits superior convergence performance as the game's decision horizon increases.
Figure 2: Learning process of different ablations on Leduc_20. (a) Without the MHA component in the high-level strategy (NO_MHA) or the baseline function for variance reduction (NO_BASELINE), convergence performance degrades significantly. Following the CFR rule (Equation (\ref{['equ:6']})) results in slightly slower convergence. (b) Increased randomness in the traverser's sample strategy enhances learning. (c) More sampled trajectories in each training episode boost initial convergence speed without affecting final performance.
Figure 3: Learning performance on Leduc_20 with transferred skills from other Leduc tasks. The transferred skills can either be fixed or not when learning a hierarchical strategy on the new scenario. The learning performance without transferred skills (labelled as HDCFR) is provided as reference. By preserving pre-learned skills, the agent focuses on mastering a high-level strategy, thus accelerating learning. However, by adjusting these skills in tandem with the high-level strategy, enhanced results are possible, as evident when using Leduc_15 skills, which peaked around episode 400.

Theorems & Definitions (22)

Theorem 1
Theorem 2
Theorem 3
Proposition 1
Theorem 4
Theorem 5
Proposition 2
Proposition 3
Proposition 4
Lemma 1
...and 12 more

Hierarchical Deep Counterfactual Regret Minimization

TL;DR

Abstract

Hierarchical Deep Counterfactual Regret Minimization

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (22)