BiCQL-ML: A Bi-Level Conservative Q-Learning Framework for Maximum Likelihood Inverse Reinforcement Learning

Junsung Park

BiCQL-ML: A Bi-Level Conservative Q-Learning Framework for Maximum Likelihood Inverse Reinforcement Learning

Junsung Park

TL;DR

BiCQL-ML tackles offline IRL by bypassing explicit policy learning and instead performing a bi-level optimization that alternates conservative Q-function estimation with maximum-likelihood-inspired reward learning. The lower level uses Conservative Q-Learning to produce robust, conservative value estimates, while the upper level instructs the reward to align with expert behavior via a surrogate soft-advantage target. The authors prove contraction and fixed-point guarantees, and empirically demonstrate improved reward recovery and downstream policy performance on MuJoCo/D4RL benchmarks, outperforming BC, DAC, and ValueDICE in several data regimes. This approach offers a stable, scalable alternative to adversarial IRL methods for offline environments with limited or no online data collection.

Abstract

Offline inverse reinforcement learning (IRL) aims to recover a reward function that explains expert behavior using only fixed demonstration data, without any additional online interaction. We propose BiCQL-ML, a policy-free offline IRL algorithm that jointly optimizes a reward function and a conservative Q-function in a bi-level framework, thereby avoiding explicit policy learning. The method alternates between (i) learning a conservative Q-function via Conservative Q-Learning (CQL) under the current reward, and (ii) updating the reward parameters to maximize the expected Q-values of expert actions while suppressing over-generalization to out-of-distribution actions. This procedure can be viewed as maximum likelihood estimation under a soft value matching principle. We provide theoretical guarantees that BiCQL-ML converges to a reward function under which the expert policy is soft-optimal. Empirically, we show on standard offline RL benchmarks that BiCQL-ML improves both reward recovery and downstream policy performance compared to existing offline IRL baselines.

BiCQL-ML: A Bi-Level Conservative Q-Learning Framework for Maximum Likelihood Inverse Reinforcement Learning

TL;DR

Abstract

BiCQL-ML: A Bi-Level Conservative Q-Learning Framework for Maximum Likelihood Inverse Reinforcement Learning

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (8)