Counting Reward Automata: Sample Efficient Reinforcement Learning Through the Exploitation of Reward Function Structure

Tristan Bester; Benjamin Rosman; Steven James; Geraud Nangue Tasse

Counting Reward Automata: Sample Efficient Reinforcement Learning Through the Exploitation of Reward Function Structure

Tristan Bester, Benjamin Rosman, Steven James, Geraud Nangue Tasse

TL;DR

This work introduces Counting Reward Automata (CRA), a counter-augmented finite-state framework capable of modelling reward functions described by unrestricted grammars, thereby expanding beyond Reward Machines to cover a wider class of temporally extended tasks. By embedding a CRA into an Automaton-Augmented MDP (AAMDP), the approach enables standard reinforcement learning algorithms to operate on augmented states, while exploiting automaton structure for sample-efficient learning via counterfactual experiences (Counterfactual Q-Learning). The authors demonstrate that CRA can express complex reward structures directly (including CFLs and context-sensitive languages), yield simpler and more scalable state machines than RM-based methods, and be specified from natural language using large language models. The combination of universal expressivity, practical sample efficiency gains, and natural-language task specification positions CRA as a powerful tool for long-horizon reinforcement learning in neuro-symbolic settings and beyond.

Abstract

We present counting reward automata-a finite state machine variant capable of modelling any reward function expressible as a formal language. Unlike previous approaches, which are limited to the expression of tasks as regular languages, our framework allows for tasks described by unrestricted grammars. We prove that an agent equipped with such an abstract machine is able to solve a larger set of tasks than those utilising current approaches. We show that this increase in expressive power does not come at the cost of increased automaton complexity. A selection of learning algorithms are presented which exploit automaton structure to improve sample efficiency. We show that the state machines required in our formulation can be specified from natural language task descriptions using large language models. Empirical results demonstrate that our method outperforms competing approaches in terms of sample efficiency, automaton complexity, and task completion.

Counting Reward Automata: Sample Efficient Reinforcement Learning Through the Exploitation of Reward Function Structure

TL;DR

Abstract

Paper Structure (32 sections, 3 theorems, 13 equations, 11 figures, 2 algorithms)

This paper contains 32 sections, 3 theorems, 13 equations, 11 figures, 2 algorithms.

Introduction
Background
MDPs, NDMPs, and RDPs
Reward Machines
Counting Reward Automata
Counter Machines
Augmenting Agents with Counter Machines
Counting Reward Automaton
Example Task
CRA Operation
Solving the Example Task
Compatible Reward Functions
Relationship between Counting Reward Automata and Reward Machines
Learning Algorithms
The AAMDP Baseline
...and 17 more sections

Key Result

Theorem 1

For each constant counting reward automaton, there exists an equivalent counting reward automaton.

Figures (11)

Figure 1: Illustration of the $LetterEnv$ environment, configured for the CFL experiment. The symbol A is replaced with a B after it has been observed $N$ times by the agent.
Figure 2: Illustration of a CCRA used to solve the example CFL task in the $LetterEnv$ environment. The $\tau$ symbol is used to represent a tautology (a propositional formula that is always true) which conditions the corresponding transition only on the states of the counters.
Figure 3: Number of samples required by each approach to obtain a solution to the task. Mean and variance are reported over 40 independent trials. A single CRA can be trained and immediately used to solve all tasks in the specification. For all other approaches, multiple policies must be learned.
Figure 4: Illustration of the Office Gridworld presented in icarte2022reward. The agent, represented as a blue circle, begins in a fixed location. The agent is able to move in any of the four cardinal directions and its observations are restricted to its current position in the environment. The symbols $\ast$ represent decorations, which are broken if the agent collides with them. Mail can be collected from the location ✉ and coffee can be made at location . A number of people are located at $P$. The trajectory for the context-sensitive task specification is shown
Figure 5: Comparison between the complexity of the state machines produced by the CRA and RM formulations. The illustration shows the complexity of the machine required to solve a task specification with a fixed upper bound on task-string length (in this case, the maximum number of mail items). RMs were constructed using a general template implementation parameterised by the maximum string length required for the task.
...and 6 more figures

Theorems & Definitions (12)

Definition 1.1: Counting Reward Automaton
Remark
Definition 1.2: Constant Counting Reward Automaton
Theorem 1
proof
Definition 1.3: Automaton-Augmented Markov Decision Process
Theorem 2
proof
Theorem 3
proof
...and 2 more

Counting Reward Automata: Sample Efficient Reinforcement Learning Through the Exploitation of Reward Function Structure

TL;DR

Abstract

Counting Reward Automata: Sample Efficient Reinforcement Learning Through the Exploitation of Reward Function Structure

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (11)

Theorems & Definitions (12)