Table of Contents
Fetching ...

BiCQL-ML: A Bi-Level Conservative Q-Learning Framework for Maximum Likelihood Inverse Reinforcement Learning

Junsung Park

TL;DR

BiCQL-ML tackles offline IRL by bypassing explicit policy learning and instead performing a bi-level optimization that alternates conservative Q-function estimation with maximum-likelihood-inspired reward learning. The lower level uses Conservative Q-Learning to produce robust, conservative value estimates, while the upper level instructs the reward to align with expert behavior via a surrogate soft-advantage target. The authors prove contraction and fixed-point guarantees, and empirically demonstrate improved reward recovery and downstream policy performance on MuJoCo/D4RL benchmarks, outperforming BC, DAC, and ValueDICE in several data regimes. This approach offers a stable, scalable alternative to adversarial IRL methods for offline environments with limited or no online data collection.

Abstract

Offline inverse reinforcement learning (IRL) aims to recover a reward function that explains expert behavior using only fixed demonstration data, without any additional online interaction. We propose BiCQL-ML, a policy-free offline IRL algorithm that jointly optimizes a reward function and a conservative Q-function in a bi-level framework, thereby avoiding explicit policy learning. The method alternates between (i) learning a conservative Q-function via Conservative Q-Learning (CQL) under the current reward, and (ii) updating the reward parameters to maximize the expected Q-values of expert actions while suppressing over-generalization to out-of-distribution actions. This procedure can be viewed as maximum likelihood estimation under a soft value matching principle. We provide theoretical guarantees that BiCQL-ML converges to a reward function under which the expert policy is soft-optimal. Empirically, we show on standard offline RL benchmarks that BiCQL-ML improves both reward recovery and downstream policy performance compared to existing offline IRL baselines.

BiCQL-ML: A Bi-Level Conservative Q-Learning Framework for Maximum Likelihood Inverse Reinforcement Learning

TL;DR

BiCQL-ML tackles offline IRL by bypassing explicit policy learning and instead performing a bi-level optimization that alternates conservative Q-function estimation with maximum-likelihood-inspired reward learning. The lower level uses Conservative Q-Learning to produce robust, conservative value estimates, while the upper level instructs the reward to align with expert behavior via a surrogate soft-advantage target. The authors prove contraction and fixed-point guarantees, and empirically demonstrate improved reward recovery and downstream policy performance on MuJoCo/D4RL benchmarks, outperforming BC, DAC, and ValueDICE in several data regimes. This approach offers a stable, scalable alternative to adversarial IRL methods for offline environments with limited or no online data collection.

Abstract

Offline inverse reinforcement learning (IRL) aims to recover a reward function that explains expert behavior using only fixed demonstration data, without any additional online interaction. We propose BiCQL-ML, a policy-free offline IRL algorithm that jointly optimizes a reward function and a conservative Q-function in a bi-level framework, thereby avoiding explicit policy learning. The method alternates between (i) learning a conservative Q-function via Conservative Q-Learning (CQL) under the current reward, and (ii) updating the reward parameters to maximize the expected Q-values of expert actions while suppressing over-generalization to out-of-distribution actions. This procedure can be viewed as maximum likelihood estimation under a soft value matching principle. We provide theoretical guarantees that BiCQL-ML converges to a reward function under which the expert policy is soft-optimal. Empirically, we show on standard offline RL benchmarks that BiCQL-ML improves both reward recovery and downstream policy performance compared to existing offline IRL baselines.

Paper Structure

This paper contains 24 sections, 4 theorems, 34 equations, 3 figures, 1 algorithm.

Key Result

Lemma 1

Define the composite update $G: \Theta \to \Theta$ by where $F_{\text{lower}}(\theta)$ returns the optimal $Q$-function for reward $r_\theta$, and $F_{\text{upper}}(Q)$ returns the maximizing reward parameters given $Q$. Under Assumption assump:lipschitz, $G$ is a contraction mapping on the reward parameter space $\Theta$. In particular, for any $\thet and since $L_{\text{ML}} L_Q < 1$ by assumpt

Figures (3)

  • Figure 1: Bi-level offline IRL algorithm: the lower level uses Conservative Q-Learning (CQL) to learn a conservative Q-function $Q^*_\phi(s,a)$, while the upper level updates the reward $R^*_\theta(s,a)$ to maximize expert likelihood under the induced Boltzmann policy. Both components are trained iteratively using only offline data.
  • Figure 2: The performance comparison under the low-data regime (single expert demonstration). BiCQL-ML significantly outperforms other baselines across Ant, Halfcheetah, Hopper, and Walker tasks in terms of sample efficiency and final average return.
  • Figure 3: The performance comparison under the high-data regime (10 expert demonstrations). BiCQL-ML achieves consistently high returns and maintains competitive or superior performance to ValueDICE and DAC across all tasks.

Theorems & Definitions (8)

  • Lemma 1: Contraction of Composite Update
  • Theorem 1: Convergence to Fixed Point
  • Lemma 2: Likelihood Optimality Condition
  • Theorem 2: Expert Optimality under Learned Reward
  • proof
  • proof
  • proof
  • proof