Table of Contents
Fetching ...

Diffusion-Based Offline RL for Improved Decision-Making in Augmented ARC Task

Yunho Kim, Jaehyun Park, Heejun Kim, Sejin Kim, Byung-Jun Lee, Sundong Kim

TL;DR

An augmented offline RL dataset for ARC is introduced, called Synthesized Offline Learning Data for Abstraction and Reasoning (SOLAR), along with the SOLAR-Generator, which generates diverse trajectory data based on predefined rules.

Abstract

Effective long-term strategies enable AI systems to navigate complex environments by making sequential decisions over extended horizons. Similarly, reinforcement learning (RL) agents optimize decisions across sequences to maximize rewards, even without immediate feedback. To verify that Latent Diffusion-Constrained Q-learning (LDCQ), a prominent diffusion-based offline RL method, demonstrates strong reasoning abilities in multi-step decision-making, we aimed to evaluate its performance on the Abstraction and Reasoning Corpus (ARC). However, applying offline RL methodologies to enhance strategic reasoning in AI for solving tasks in ARC is challenging due to the lack of sufficient experience data in the ARC training set. To address this limitation, we introduce an augmented offline RL dataset for ARC, called Synthesized Offline Learning Data for Abstraction and Reasoning (SOLAR), along with the SOLAR-Generator, which generates diverse trajectory data based on predefined rules. SOLAR enables the application of offline RL methods by offering sufficient experience data. We synthesized SOLAR for a simple task and used it to train an agent with the LDCQ method. Our experiments demonstrate the effectiveness of the offline RL approach on a simple ARC task, showing the agent's ability to make multi-step sequential decisions and correctly identify answer states. These results highlight the potential of the offline RL approach to enhance AI's strategic reasoning capabilities.

Diffusion-Based Offline RL for Improved Decision-Making in Augmented ARC Task

TL;DR

An augmented offline RL dataset for ARC is introduced, called Synthesized Offline Learning Data for Abstraction and Reasoning (SOLAR), along with the SOLAR-Generator, which generates diverse trajectory data based on predefined rules.

Abstract

Effective long-term strategies enable AI systems to navigate complex environments by making sequential decisions over extended horizons. Similarly, reinforcement learning (RL) agents optimize decisions across sequences to maximize rewards, even without immediate feedback. To verify that Latent Diffusion-Constrained Q-learning (LDCQ), a prominent diffusion-based offline RL method, demonstrates strong reasoning abilities in multi-step decision-making, we aimed to evaluate its performance on the Abstraction and Reasoning Corpus (ARC). However, applying offline RL methodologies to enhance strategic reasoning in AI for solving tasks in ARC is challenging due to the lack of sufficient experience data in the ARC training set. To address this limitation, we introduce an augmented offline RL dataset for ARC, called Synthesized Offline Learning Data for Abstraction and Reasoning (SOLAR), along with the SOLAR-Generator, which generates diverse trajectory data based on predefined rules. SOLAR enables the application of offline RL methods by offering sufficient experience data. We synthesized SOLAR for a simple task and used it to train an agent with the LDCQ method. Our experiments demonstrate the effectiveness of the offline RL approach on a simple ARC task, showing the agent's ability to make multi-step sequential decisions and correctly identify answer states. These results highlight the potential of the offline RL approach to enhance AI's strategic reasoning capabilities.

Paper Structure

This paper contains 39 sections, 3 equations, 11 figures, 3 tables, 1 algorithm.

Figures (11)

  • Figure 1: Three tasks in ARC. Each task consists of demonstration examples and a test example. Each example has an input grid and an output answer grid. Each pixel in the grid is matched to a color corresponding to a value in the range 0–9. ARC requires identifying common rules from the demonstration examples and applying them to solve the test example correctly. Despite recent advancements in AI, current models have consistently underperformed compared to humans on the ARC benchmark arcprize2024johnson2021fast.
  • Figure 2: An example of a single step in ARCLE. In this example step, the action has an operation 30 (Paste) and a selection of $[3, 0, 2, 2]$. The top-left coordinate of the selection box is $[3,0]$ and the bottom-right coordinate is $[5,2]$. $[h_t,w_t]$ is calculated by subtracting $[3,0]$ from $[5,2]$. When ARCLE executes this action, the current clipboard is pasted into the bounding box specified by the selection on the current grid. It then returns episode information, including the reward and termination status.
  • Figure 3: (a)--(c) Training stages of LDCQ. (a) Training a $\beta$-VAE with an encoder that encodes $\mathit{H}$-horizon segment trajectories into latents $\bm{z}_t$, and a policy decoder that decodes actions based on $\bm{z}_t$ and state $\bm{s}_{t+h}$ where $h \in [ 0,H )$ contained in the latent. (b) Training a diffusion model based on $\bm{z}_t$ and the $\bm{s}_t$. (c) Training a Q-network using latents sampled through the diffusion model. (d) LDCQ inference step at $\bm{s}_{t+h}$. Possible latents at $\bm{s}_t$ are sampled through the diffusion model, and the agent executes actions resulting from decoding the latent with the highest Q-value.
  • Figure 4: Data synthesis procedure with SOLAR-Generator. The state and actions consist of as mentioned in Section \ref{['Sec:ARCLE']}. 1) Loading Synthesized Data: The Grid Maker module applies constraints, augments input-output pairs, and synthesizes solutions for specific tasks by utilizing actions. 2) Validating Trajectories: Checks whether the generated actions are executable in ARCLE. 3) Structuring SOLAR: Organizes and stores the synthesized data in SOLAR based on the defined format. This step determines what information to include in the dataset and whether to segment episodes into fixed-length chunks or store them as a whole.
  • Figure 5: SOLAR episodes for a simple task: The gold standard trajectory (episode) contains the steps to solve the problem by using the core knowledge priors properly. The non-optimal episodes branch off at a random step within the standard trajectory, performing random operations such as Rotate, Flip, or Copy & Paste, and then Submit after a certain number of steps.
  • ...and 6 more figures