Table of Contents
Fetching ...

Learning Logic Specifications for Policy Guidance in POMDPs: an Inductive Logic Programming Approach

Daniele Meli, Alberto Castellini, Alessandro Farinelli

TL;DR

This work tackles the challenge of guiding online POMDP planning with domain-relevant, interpretable heuristics learned from execution traces. It leverages Inductive Logic Programming via ILASP to produce Answer Set Programs that map belief-based features to actions, enabling soft policy guidance integrated into POMCP and DESPOT (and extensions like AdaOPS). The approach yields high-quality, human-interpretable policy specifications from relatively few traces, outperforming neural policy learners in data efficiency and offering generalization to larger and more complex problems such as rocksample and pocman. Empirical results show improved planning performance and reduced computational effort, with robust behavior under partial or missing feature information and the ability to adapt across solvers. The work demonstrates the value of combining symbolic reasoning with online planning for scalable, interpretable POMDP solutions and suggests avenues for temporal extensions and online learning in future work.

Abstract

Partially Observable Markov Decision Processes (POMDPs) are a powerful framework for planning under uncertainty. They allow to model state uncertainty as a belief probability distribution. Approximate solvers based on Monte Carlo sampling show great success to relax the computational demand and perform online planning. However, scaling to complex realistic domains with many actions and long planning horizons is still a major challenge, and a key point to achieve good performance is guiding the action-selection process with domain-dependent policy heuristics which are tailored for the specific application domain. We propose to learn high-quality heuristics from POMDP traces of executions generated by any solver. We convert the belief-action pairs to a logical semantics, and exploit data- and time-efficient Inductive Logic Programming (ILP) to generate interpretable belief-based policy specifications, which are then used as online heuristics. We evaluate thoroughly our methodology on two notoriously challenging POMDP problems, involving large action spaces and long planning horizons, namely, rocksample and pocman. Considering different state-of-the-art online POMDP solvers, including POMCP, DESPOT and AdaOPS, we show that learned heuristics expressed in Answer Set Programming (ASP) yield performance superior to neural networks and similar to optimal handcrafted task-specific heuristics within lower computational time. Moreover, they well generalize to more challenging scenarios not experienced in the training phase (e.g., increasing rocks and grid size in rocksample, incrementing the size of the map and the aggressivity of ghosts in pocman).

Learning Logic Specifications for Policy Guidance in POMDPs: an Inductive Logic Programming Approach

TL;DR

This work tackles the challenge of guiding online POMDP planning with domain-relevant, interpretable heuristics learned from execution traces. It leverages Inductive Logic Programming via ILASP to produce Answer Set Programs that map belief-based features to actions, enabling soft policy guidance integrated into POMCP and DESPOT (and extensions like AdaOPS). The approach yields high-quality, human-interpretable policy specifications from relatively few traces, outperforming neural policy learners in data efficiency and offering generalization to larger and more complex problems such as rocksample and pocman. Empirical results show improved planning performance and reduced computational effort, with robust behavior under partial or missing feature information and the ability to adapt across solvers. The work demonstrates the value of combining symbolic reasoning with online planning for scalable, interpretable POMDP solutions and suggests avenues for temporal extensions and online learning in future work.

Abstract

Partially Observable Markov Decision Processes (POMDPs) are a powerful framework for planning under uncertainty. They allow to model state uncertainty as a belief probability distribution. Approximate solvers based on Monte Carlo sampling show great success to relax the computational demand and perform online planning. However, scaling to complex realistic domains with many actions and long planning horizons is still a major challenge, and a key point to achieve good performance is guiding the action-selection process with domain-dependent policy heuristics which are tailored for the specific application domain. We propose to learn high-quality heuristics from POMDP traces of executions generated by any solver. We convert the belief-action pairs to a logical semantics, and exploit data- and time-efficient Inductive Logic Programming (ILP) to generate interpretable belief-based policy specifications, which are then used as online heuristics. We evaluate thoroughly our methodology on two notoriously challenging POMDP problems, involving large action spaces and long planning horizons, namely, rocksample and pocman. Considering different state-of-the-art online POMDP solvers, including POMCP, DESPOT and AdaOPS, we show that learned heuristics expressed in Answer Set Programming (ASP) yield performance superior to neural networks and similar to optimal handcrafted task-specific heuristics within lower computational time. Moreover, they well generalize to more challenging scenarios not experienced in the training phase (e.g., increasing rocks and grid size in rocksample, incrementing the size of the map and the aggressivity of ghosts in pocman).
Paper Structure (42 sections, 38 equations, 12 figures, 4 tables)

This paper contains 42 sections, 38 equations, 12 figures, 4 tables.

Figures (12)

  • Figure 1: Example scenarios for our two case studies.
  • Figure 2: Simplified rocksample scenario with $N=M=2$.
  • Figure 3: Discounted return (top) and computational time per step (bottom) with POMCP (mean $\pm$ standard deviation) for rocksample and pocman with different number of particles in the training setting (EXP-1).
  • Figure 4: POMCP performance (mean $\pm$ standard deviation) for rocksample with different number of rocks and grid size (EXP-2).
  • Figure 5: Discounted return (top) and computational time per step (bottom) with DESPOT C++ (mean $\pm$ standard deviation) for rocksample in larger grids and with more rocks (EXP-2).
  • ...and 7 more figures

Theorems & Definitions (3)

  • Definition 1: Partial interpretation
  • Definition 2: Context-dependent partial interpretation (CDPI)
  • Definition 3: ILASP task with CDPIs