Functional Critic Modeling for Provably Convergent Off-Policy Actor-Critic
Qinxun Bai, Yuxuan Han, Wei Xu, Zhengyuan Zhou
TL;DR
The paper tackles the instability of off-policy actor-critic learning under function approximation by introducing functional critic modeling, which maps policies to value estimates and allows the critic to generalize across changing policies. It presents a meta-algorithm that decouples policy evaluation from policy improvement, enabling exact off-policy gradients to be computed from the learned functional critic without emphatic corrections. The authors provide a theoretical convergence analysis in the linear functional-approximation setting, establishing the first convergent off-policy target-based AC algorithm under function approximation, and offer a minimal neural-network implementation with preliminary results on DeepMind Control benchmarks. Practically, the approach uses an ensemble of functional critics with target networks, transformer-based actor encoders, and deterministic actors to achieve stable learning without slow two-timescale updates, showing promising data efficiency and performance relative to state-of-the-art baselines. Overall, functional critic modeling can address core challenges of off-policy AC and improve sample efficiency while preserving convergence guarantees and scalability to neural architectures.
Abstract
Off-policy reinforcement learning (RL) with function approximation offers an effective way to improve sample efficiency by reusing past experience. Within this setting, the actor-critic (AC) framework has achieved strong empirical success. However, both the critic and actor learning is challenging for the off-policy AC methods: first of all, in addition to the classic "deadly triad" instability of off-policy evaluation, it also suffers from a "moving target" problem, where the policy being evaluated changes continually; secondly, actor learning becomes less efficient due to the difficulty of estimating the exact off-policy policy gradient. The first challenge essentially reduces the problem to repeatedly performing off-policy evaluation for changing policies. For the second challenge, the off-policy policy gradient theorem requires a complex and often impractical algorithm to estimate an additional emphasis critic, which is typically neglected in practice, thereby reducing to the on-policy policy gradient as an approximation. In this work, we introduce a novel concept of functional critic modeling, which leads to a new AC framework that addresses both challenges for actor-critic learning under the deadly triad setting. We provide a theoretical analysis in the linear function setting, establishing the provable convergence of our framework, which, to the best of our knowledge, is the first convergent off-policy target-based AC algorithm. From a practical perspective, we further propose a carefully designed neural network architecture for the functional critic modeling and demonstrate its effectiveness through preliminary experiments on widely used RL tasks from the DeepMind Control Benchmark.
