A Model-Free Universal AI

Yegon Kim; Juho Lee

A Model-Free Universal AI

Yegon Kim, Juho Lee

TL;DR

This paper introduces Universal AI with Q-Induction (AIQI), the first model-free agent proven to be asymptotically asymptotically $\varepsilon$-optimal in general RL.

Abstract

In general reinforcement learning, all established optimal agents, including AIXI, are model-based, explicitly maintaining and using environment models. This paper introduces Universal AI with Q-Induction (AIQI), the first model-free agent proven to be asymptotically $\varepsilon$-optimal in general RL. AIQI performs universal induction over distributional action-value functions, instead of policies or environments like previous works. Under a grain of truth condition, we prove that AIQI is strong asymptotically $\varepsilon$-optimal and asymptotically $\varepsilon$-Bayes-optimal. Our results significantly expand the diversity of known universal agents.

A Model-Free Universal AI

TL;DR

This paper introduces Universal AI with Q-Induction (AIQI), the first model-free agent proven to be asymptotically asymptotically

-optimal in general RL.

Abstract

-optimal in general RL. AIQI performs universal induction over distributional action-value functions, instead of policies or environments like previous works. Under a grain of truth condition, we prove that AIQI is strong asymptotically

-optimal and asymptotically

-Bayes-optimal. Our results significantly expand the diversity of known universal agents.

Paper Structure (31 sections, 16 theorems, 92 equations, 1 figure, 3 tables, 1 algorithm)

This paper contains 31 sections, 16 theorems, 92 equations, 1 figure, 3 tables, 1 algorithm.

Introduction
Background
General Reinforcement Learning
AIXI
Universal AI with Q-Induction
Results
Convergence of Return-Predictor
One-Step Optimality
Asymptotic Optimality
Off-Policy Behavior
Related Work
Discussion
Proof Technique
Partial Observability
Continual Reinforcement Learning
...and 16 more sections

Key Result

Lemma 1

Let $\pi$ be the AIQI policy $\hat{\pi}^{H,M,N,\tau}_\psi$ where $\psi$ has a grain of truth w.r.t. an environment $\nu$. There exists a $\nu^\pi$-probability-one set $S \subseteq \Omega$ such that for all $h\in S$ and $n\in [0,N-1]$,

Figures (1)

Figure 1: Plot of EMA reward vs wall clock time (in seconds) on three environments

Theorems & Definitions (34)

Definition 1: Effective horizon
Definition 2: Mixture environment
Definition 3: AIXI
Definition 4: Mixture return-predictor
Definition 5: Unified return-predictor
Definition 6: AIQI
Definition 7: Grain of truth
Definition 8: Strong Asymptotic $\varepsilon$-Optimality
Lemma 1: Convergence in TV distance
Lemma 2: Convergence of return-predictor
...and 24 more

A Model-Free Universal AI

TL;DR

Abstract

A Model-Free Universal AI

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (1)

Theorems & Definitions (34)