Table of Contents
Fetching ...

Solving Rubik's Cube Without Tricky Sampling

Yicheng Lin, Siyu Liang

TL;DR

The paper tackles sparse-reward reinforcement learning for the Rubik’s Cube by learning directly from fully scrambled states, avoiding near-solved-state sampling and search. It introduces a policy-gradient framework centered on ChaseNet, a neural predictor of state-pair costs, integrated into NX, Env, and Actor modules to guide learning from disordered configurations. On the 2x2x2 cube, the approach achieves over 99.4% success across 50,000 scrambled trials without tree search, demonstrating strong performance in a challenging sparse-reward setting. These results suggest a promising direction for generalized sparse-reward problems and motivate scaling to larger puzzles and diverse domains.

Abstract

The Rubiks Cube, with its vast state space and sparse reward structure, presents a significant challenge for reinforcement learning (RL) due to the difficulty of reaching rewarded states. Previous research addressed this by propagating cost-to-go estimates from the solved state and incorporating search techniques. These approaches differ from human strategies that start from fully scrambled cubes, which can be tricky for solving a general sparse-reward problem. In this paper, we introduce a novel RL algorithm using policy gradient methods to solve the Rubiks Cube without relying on near solved-state sampling. Our approach employs a neural network to predict cost patterns between states, allowing the agent to learn directly from scrambled states. Our method was tested on the 2x2x2 Rubiks Cube, where the cube was scrambled 50,000 times, and the model successfully solved it in over 99.4% of cases. Notably, this result was achieved using only the policy network without relying on tree search as in previous methods, demonstrating its effectiveness and potential for broader applications in sparse-reward problems.

Solving Rubik's Cube Without Tricky Sampling

TL;DR

The paper tackles sparse-reward reinforcement learning for the Rubik’s Cube by learning directly from fully scrambled states, avoiding near-solved-state sampling and search. It introduces a policy-gradient framework centered on ChaseNet, a neural predictor of state-pair costs, integrated into NX, Env, and Actor modules to guide learning from disordered configurations. On the 2x2x2 cube, the approach achieves over 99.4% success across 50,000 scrambled trials without tree search, demonstrating strong performance in a challenging sparse-reward setting. These results suggest a promising direction for generalized sparse-reward problems and motivate scaling to larger puzzles and diverse domains.

Abstract

The Rubiks Cube, with its vast state space and sparse reward structure, presents a significant challenge for reinforcement learning (RL) due to the difficulty of reaching rewarded states. Previous research addressed this by propagating cost-to-go estimates from the solved state and incorporating search techniques. These approaches differ from human strategies that start from fully scrambled cubes, which can be tricky for solving a general sparse-reward problem. In this paper, we introduce a novel RL algorithm using policy gradient methods to solve the Rubiks Cube without relying on near solved-state sampling. Our approach employs a neural network to predict cost patterns between states, allowing the agent to learn directly from scrambled states. Our method was tested on the 2x2x2 Rubiks Cube, where the cube was scrambled 50,000 times, and the model successfully solved it in over 99.4% of cases. Notably, this result was achieved using only the policy network without relying on tree search as in previous methods, demonstrating its effectiveness and potential for broader applications in sparse-reward problems.

Paper Structure

This paper contains 12 sections, 4 equations, 3 figures, 2 algorithms.

Figures (3)

  • Figure 2.1: General Representation of Rubik's cube problem. a. The state of the cube can be represented by a vector, with each sticker encoded as a number. b. Scrambles are modeled using a permutation matrix applied to the state vector. c. The cube’s topological structure is visualized as a state graph, where nodes represent states and edges represent transitions.
  • Figure 2.2: The NX Module a. The NX Module training process is divided into a warmup phase and a training phase. b. Two architectural variants of ChaseNet: ChaseNet-FC (fully connected) and ChaseNet-Attention (attention-based).
  • Figure 3.1: a. Warmup loss for ChaseNet-FC and ChaseNet-Attention. b. Average rewards during RL training using rewards from ChaseNet-FC and ChaseNet-Attention. c. Success rate during RL training using rewards from ChaseNet-FC and ChaseNet-Attention.