Offline Imitation Learning from Multiple Baselines with Applications to Compiler Optimization

Teodor V. Marinov; Alekh Agarwal; Mircea Trofin

Offline Imitation Learning from Multiple Baselines with Applications to Compiler Optimization

Teodor V. Marinov, Alekh Agarwal, Mircea Trofin

TL;DR

The paper tackles offline imitation learning with $K$ baseline policies in a contextual finite-horizon MDP, where only trajectory-level rewards are observed. It introduces BC-Max, a simple yet effective algorithm that, for each context, imitates the actions from the highest-reward baseline trajectory and provides a regret bound Reg$\$\\hat{\\pi}\$ \\leq O(\\epsilon H + \\frac{H^2 \\log(H|\\Pi|/\\delta)}{n})$ under a realizability assumption, with a matching lower bound establishing minimax optimality. The authors demonstrate practical value through a compiler-optimization case study, showing that iterative BC-Max can improve over an initial online RL baseline in reducing binary size, using both proprietary and Chrome-on-Android datasets. The work offers a scalable offline learning recipe with strong theoretical guarantees and real-world applicability, enabling policy improvement from multiple baselines without online interaction.

Abstract

This work studies a Reinforcement Learning (RL) problem in which we are given a set of trajectories collected with K baseline policies. Each of these policies can be quite suboptimal in isolation, and have strong performance in complementary parts of the state space. The goal is to learn a policy which performs as well as the best combination of baselines on the entire state space. We propose a simple imitation learning based algorithm, show a sample complexity bound on its accuracy and prove that the the algorithm is minimax optimal by showing a matching lower bound. Further, we apply the algorithm in the setting of machine learning guided compiler optimization to learn policies for inlining programs with the objective of creating a small binary. We demonstrate that we can learn a policy that outperforms an initial policy learned via standard RL through a few iterations of our approach.

Offline Imitation Learning from Multiple Baselines with Applications to Compiler Optimization

TL;DR

The paper tackles offline imitation learning with

baseline policies in a contextual finite-horizon MDP, where only trajectory-level rewards are observed. It introduces BC-Max, a simple yet effective algorithm that, for each context, imitates the actions from the highest-reward baseline trajectory and provides a regret bound Reg

under a realizability assumption, with a matching lower bound establishing minimax optimality. The authors demonstrate practical value through a compiler-optimization case study, showing that iterative BC-Max can improve over an initial online RL baseline in reducing binary size, using both proprietary and Chrome-on-Android datasets. The work offers a scalable offline learning recipe with strong theoretical guarantees and real-world applicability, enabling policy improvement from multiple baselines without online interaction.

Abstract

Paper Structure (26 sections, 2 theorems, 18 equations, 4 figures, 3 algorithms)

This paper contains 26 sections, 2 theorems, 18 equations, 4 figures, 3 algorithms.

Introduction
Setting and Related Work
Problem setting
Contextual MDP setting.
Goal.
Learning setup.
Related work
Vanilla behavior cloning
Value-based improvement upon multiple baselines (MAMBA)
Offline RL:
Algorithm and Regret Bound
Algorithm.
Performance guarantee for BC-Max.
Implementation details
Lower bounds
...and 11 more sections

Key Result

Theorem 3.2

Under Assumption assm:realizability, after collecting $n$ trajectories from each of the $K$ base policies, Algorithm alg:bc returns a policy $\hat{\pi}$ with regret at most with probability at least $1-\delta$.

Figures (4)

Figure 1: Savings in MB from ES on training binary
Figure 2: Savings in MB from ES on test binary
Figure 3: Savings in MBs from PPO on sum of module sizes
Figure 4: Savings in MBs from PPO on binary size

Theorems & Definitions (3)

Theorem 3.2
proof
Theorem 4.1

Offline Imitation Learning from Multiple Baselines with Applications to Compiler Optimization

TL;DR

Abstract

Offline Imitation Learning from Multiple Baselines with Applications to Compiler Optimization

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (3)