Table of Contents
Fetching ...

A Model-Free Universal AI

Yegon Kim, Juho Lee

TL;DR

This paper introduces Universal AI with Q-Induction (AIQI), the first model-free agent proven to be asymptotically asymptotically $\varepsilon$-optimal in general RL.

Abstract

In general reinforcement learning, all established optimal agents, including AIXI, are model-based, explicitly maintaining and using environment models. This paper introduces Universal AI with Q-Induction (AIQI), the first model-free agent proven to be asymptotically $\varepsilon$-optimal in general RL. AIQI performs universal induction over distributional action-value functions, instead of policies or environments like previous works. Under a grain of truth condition, we prove that AIQI is strong asymptotically $\varepsilon$-optimal and asymptotically $\varepsilon$-Bayes-optimal. Our results significantly expand the diversity of known universal agents.

A Model-Free Universal AI

TL;DR

This paper introduces Universal AI with Q-Induction (AIQI), the first model-free agent proven to be asymptotically asymptotically -optimal in general RL.

Abstract

In general reinforcement learning, all established optimal agents, including AIXI, are model-based, explicitly maintaining and using environment models. This paper introduces Universal AI with Q-Induction (AIQI), the first model-free agent proven to be asymptotically -optimal in general RL. AIQI performs universal induction over distributional action-value functions, instead of policies or environments like previous works. Under a grain of truth condition, we prove that AIQI is strong asymptotically -optimal and asymptotically -Bayes-optimal. Our results significantly expand the diversity of known universal agents.
Paper Structure (31 sections, 16 theorems, 92 equations, 1 figure, 3 tables, 1 algorithm)

This paper contains 31 sections, 16 theorems, 92 equations, 1 figure, 3 tables, 1 algorithm.

Key Result

Lemma 1

Let $\pi$ be the AIQI policy $\hat{\pi}^{H,M,N,\tau}_\psi$ where $\psi$ has a grain of truth w.r.t. an environment $\nu$. There exists a $\nu^\pi$-probability-one set $S \subseteq \Omega$ such that for all $h\in S$ and $n\in [0,N-1]$,

Figures (1)

  • Figure 1: Plot of EMA reward vs wall clock time (in seconds) on three environments

Theorems & Definitions (34)

  • Definition 1: Effective horizon
  • Definition 2: Mixture environment
  • Definition 3: AIXI
  • Definition 4: Mixture return-predictor
  • Definition 5: Unified return-predictor
  • Definition 6: AIQI
  • Definition 7: Grain of truth
  • Definition 8: Strong Asymptotic $\varepsilon$-Optimality
  • Lemma 1: Convergence in TV distance
  • Lemma 2: Convergence of return-predictor
  • ...and 24 more