Reinforcement Learning with Knowledge Representation and Reasoning: A Brief Survey

Chao Yu; Shicheng Ye; Hankz Hankui Zhuo

Reinforcement Learning with Knowledge Representation and Reasoning: A Brief Survey

Chao Yu, Shicheng Ye, Hankz Hankui Zhuo

TL;DR

Reinforcement learning (RL) often suffers from limited sample efficiency, poor generalization, and safety/interpretability gaps. The paper surveys how Knowledge Representation and Reasoning (KRR) methods—such as Reward Machines for non-Markovian rewards, temporal-logic automata, Answer Set Programming, Markov Logic Networks, and planning formalisms—can be integrated with RL to improve efficiency, generalization, and safety. It categorizes the literature into three strands (efficiency, generalization, safety/interpretability), detailing representative approaches like QRM/HRM, PEORL/SDRL, TLs with MITL, RMs for transfer, and safety monitors, while highlighting open problems such as extending beyond LTL and integrating LLMs. Overall, the survey points to a future where AI agents combine high-level reasoning with exploratory RL to achieve scalable, verifiable, and adaptable behavior across complex tasks.

Abstract

Reinforcement Learning (RL) has achieved tremendous development in recent years, but still faces significant obstacles in addressing complex real-life problems due to the issues of poor system generalization, low sample efficiency as well as safety and interpretability concerns. The core reason underlying such dilemmas can be attributed to the fact that most of the work has focused on the computational aspect of value functions or policies using a representational model to describe atomic components of rewards, states and actions etc, thus neglecting the rich high-level declarative domain knowledge of facts, relations and rules that can be either provided a priori or acquired through reasoning over time. Recently, there has been a rapidly growing interest in the use of Knowledge Representation and Reasoning (KRR) methods, usually using logical languages, to enable more abstract representation and efficient learning in RL. In this survey, we provide a preliminary overview on these endeavors that leverage the strengths of KRR to help solving various problems in RL, and discuss the challenging open problems and possible directions for future work in this area.

Reinforcement Learning with Knowledge Representation and Reasoning: A Brief Survey

TL;DR

Abstract

Paper Structure (19 sections, 3 equations, 9 figures, 3 tables)

This paper contains 19 sections, 3 equations, 9 figures, 3 tables.

Introduction
Background
RL
KRR
KRR for Efficiency of RL
Task Representation
Symbolic Planning
KRR for Generalization of RL
Transfer Learning
Continuous/Lifelong Learning
KRR for Safety/Interpretability of RL
Interpretability
Safety
Challenges
Increasing the Spectrum of KRR/RL Methods
...and 4 more sections

Figures (9)

Figure 1: Key components and concepts in RL and KRR.
Figure 2: An illustration of representing a task as an RM and one-step update of the QRM algorithm. The task of "get mail and coffee" can be modeled using an RM with 4 abstract states $u_0,u_1,u_2$ and $u_3$. Starting at $u_0$ (implying the initial stage of the task), when an agent gets a coffee, the RM receives this abstracted description and then transitions to $u_2$ (implying the stage that the coffee has been gotten) and returns a reward $r=0$. When the RM reaches its accepting state $u_3$, it indicates that the task has been completed.
Figure 3: An illustration of handling a multi-agent cooperative task using RMs (adapted from neary2020reward). In the scenario, three agents must work collaboratively to guide agent $A_1$ to the target location $Goal$. The colored zones indicate regions where a corresponding color-coded button must be activated for an agent to pass. For the red zone, both agents $A_2$ and $A_3$ must press the red button simultaneously to allow agent $A_1$ to pass the red zone, while the yellow and green zones require only one agent to press the respective button. This multi-agent cooperative task can be represented using the corresponding RM, and the set of events of the RM is $\sum = \{Y_B, G_B, R_B, A_2^{R_B}, A_2^{\neg R_B}, A_3^{R_B}, A_3^{\neg R_B}, Goal\}$. For the RM representing the entire task, it can be decomposed into RMs corresponding to each agent's subtask by projecting onto the local event set of each agent.
Figure 4: An illustration of the PRM and SRM. The original task is to bring coffee to the office, with a reward of 1 upon successful completion. To expand this into a scenario with random rewards, an element of randomness is introduced: the coffee machine has a 10% chance of malfunctioning, resulting in substandard coffee, which means no reward can be obtained. This task specification can be utilized to construct the corresponding PRM and SRM. In the PRM, a nondeterministic transition function is employed to achieve this randomness, where $y \overset{l|p}{\underset{r}{\rightarrow}} y\prime$ indicates that $y$ transitions to $y\prime$ on high-level label $l$ with probability $p$, receiving reward $r$. For the SRM, a stochastic reward function is used, where $y \overset{l}{\underset{U}{\rightarrow}} y'$ indicates that $y$ transitions to $y\prime$ on high-level label $l$, receiving reward sampled from the uniform distribution $U$.
Figure 5: An illustration of (a) automatic reward function generation based on TLTL; (b) converting a task specified in $\omega$-regular LTL into a LDBA; and (c) a possible myopic case that may arise when utilizing LDBA (adapted from voloshin2023eventual). In (a), the task is defined by TLTL with quantitative semantics, enabling the automatic generation of a reward function that guides the agent. In (b), the task is achieved when the accepting states in the LDBA are accessed infinitely, with the only accepting state being $u_1$. In (c), an agent starts in state $u_0$ and only has two actions $A$ and $B$. Taking action $A$ transitions directly to an accepting state, from which the agent visits the accepting state every two steps. In contrast, action $B$ transitions to an accepting state with probability $\alpha$ and to a sink state with probability $1-\alpha$. However, when $\alpha$ exceeds a certain threshold, the agent may take the risk of choosing action $B$, leading to myopic behavior.
...and 4 more figures

Reinforcement Learning with Knowledge Representation and Reasoning: A Brief Survey

TL;DR

Abstract

Reinforcement Learning with Knowledge Representation and Reasoning: A Brief Survey

Authors

TL;DR

Abstract

Table of Contents

Figures (9)