Table of Contents
Fetching ...

Learning to Play Blackjack: A Curriculum Learning Perspective

Amirreza Alasti, Efe Erdal, Yücel Celik, Theresa Eimer

Abstract

Reinforcement Learning (RL) agents often struggle with efficiency and performance in complex environments. We propose a novel framework that uses a Large Language Model (LLM) to dynamically generate a curriculum over available actions, enabling the agent to incorporate each action individually. We apply this framework to the game of Blackjack, where the LLM creates a multi-stage training path that progressively introduces complex actions to a Tabular Q-Learning and a Deep Q-Network (DQN) agent. Our evaluation in a realistic 8-deck simulation over 10 independent runs demonstrates significant performance gains over standard training methods. The curriculum-based approach increases the DQN agent's average win rate from 43.97% to 47.41%, reduces the average bust rate from 32.9% to 28.0%, and accelerates the overall workflow by over 74%, with the agent's full training completing faster than the baseline's evaluation phase alone. These results validate that LLM-guided curricula can build more effective, robust, and efficient RL agents.

Learning to Play Blackjack: A Curriculum Learning Perspective

Abstract

Reinforcement Learning (RL) agents often struggle with efficiency and performance in complex environments. We propose a novel framework that uses a Large Language Model (LLM) to dynamically generate a curriculum over available actions, enabling the agent to incorporate each action individually. We apply this framework to the game of Blackjack, where the LLM creates a multi-stage training path that progressively introduces complex actions to a Tabular Q-Learning and a Deep Q-Network (DQN) agent. Our evaluation in a realistic 8-deck simulation over 10 independent runs demonstrates significant performance gains over standard training methods. The curriculum-based approach increases the DQN agent's average win rate from 43.97% to 47.41%, reduces the average bust rate from 32.9% to 28.0%, and accelerates the overall workflow by over 74%, with the agent's full training completing faster than the baseline's evaluation phase alone. These results validate that LLM-guided curricula can build more effective, robust, and efficient RL agents.

Paper Structure

This paper contains 42 sections, 13 figures, 1 table.

Figures (13)

  • Figure 1: The adaptive, LLM-guided curriculum learning framework. (1) Generation & Adaptation: An LLM generates an initial curriculum stage. After a training phase, the agent's performance summary is fed back to the LLM, which adapts the curriculum by deciding whether to advance the agent to the next stage. (2) Training Loop: The agent's training is constrained by the current curriculum stage, which masks unavailable actions in the environment to focus exploration.
  • Figure 2: A statistical summary of agent performance over 10 independent runs in the 8-deck curriculum environment. The results highlight the DQN agent's significantly higher peak performance distribution (left) and confirm that this peak is most frequently achieved at Stage 4 (right). The high stage completion rate for both agents (middle) indicates the curriculum was well-paced and successfully navigated in most runs.
  • Figure 3: Win rate progression for baseline DQN and Tabular agents trained without a curriculum. The DQN agent's performance is highly volatile, illustrating the instability and exploration challenges that arise when the full action space is introduced at once.
  • Figure 4: Win rate progression for DQN and Tabular agents trained with the LLM-guided curriculum. The curriculum provides a stable learning trajectory for the DQN agent, preventing the performance volatility seen in baseline training and allowing it to converge to a high and consistent win rate.
  • Figure 5: Deeper analysis of agent performance, showing a positive correlation between the number of curriculum stages completed and the best achieved win rate for the DQN agent (left). However, the agent's final performance is often lower than its peak (middle), reinforcing the finding, quantified in the summary (right), that the optimal policy is achieved at an intermediate stage.
  • ...and 8 more figures