Table of Contents
Fetching ...

LICORICE: Label-Efficient Concept-Based Interpretable Reinforcement Learning

Zhuorui Ye, Stephanie Milani, Geoffrey J. Gordon, Fei Fang

TL;DR

LICORICE tackles the annotation bottleneck in concept-based RL by introducing a training framework that interleaves concept learning with RL, decorrelates training data to enhance sample diversity, and employs disagreement-based active learning to minimize labeled data requirements. It demonstrates strong label efficiency across five image-based tasks, achieving near-budget-unconstrained performance with as few as 500–5000 concept labels, and explores automated labeling via vision-language models with mixed success. The approach also enables test-time concept interventions, analyzes data diversity and hyperparameter robustness, and discusses societal and practical implications. Overall, LICORICE advances practical interpretable RL by delivering high-performance policies under tight concept-label budgets while providing a clear blueprint for active, on-policy concept learning.

Abstract

Recent advances in reinforcement learning (RL) have predominantly leveraged neural network policies for decision-making, yet these models often lack interpretability, posing challenges for stakeholder comprehension and trust. Concept bottleneck models offer an interpretable alternative by integrating human-understandable concepts into policies. However, prior work assumes that concept annotations are readily available during training. For RL, this requirement poses a significant limitation: it necessitates continuous real-time concept annotation, which either places an impractical burden on human annotators or incurs substantial costs in API queries and inference time when employing automated labeling methods. To overcome this limitation, we introduce a novel training scheme that enables RL agents to efficiently learn a concept-based policy by only querying annotators to label a small set of data. Our algorithm, LICORICE, involves three main contributions: interleaving concept learning and RL training, using an ensemble to actively select informative data points for labeling, and decorrelating the concept data. We show how LICORICE reduces human labeling efforts to 500 or fewer concept labels in three environments, and 5000 or fewer in two more complex environments, all at no cost to performance. We also explore the use of VLMs as automated concept annotators, finding them effective in some cases but imperfect in others. Our work significantly reduces the annotation burden for interpretable RL, making it more practical for real-world applications that necessitate transparency.

LICORICE: Label-Efficient Concept-Based Interpretable Reinforcement Learning

TL;DR

LICORICE tackles the annotation bottleneck in concept-based RL by introducing a training framework that interleaves concept learning with RL, decorrelates training data to enhance sample diversity, and employs disagreement-based active learning to minimize labeled data requirements. It demonstrates strong label efficiency across five image-based tasks, achieving near-budget-unconstrained performance with as few as 500–5000 concept labels, and explores automated labeling via vision-language models with mixed success. The approach also enables test-time concept interventions, analyzes data diversity and hyperparameter robustness, and discusses societal and practical implications. Overall, LICORICE advances practical interpretable RL by delivering high-performance policies under tight concept-label budgets while providing a clear blueprint for active, on-policy concept learning.

Abstract

Recent advances in reinforcement learning (RL) have predominantly leveraged neural network policies for decision-making, yet these models often lack interpretability, posing challenges for stakeholder comprehension and trust. Concept bottleneck models offer an interpretable alternative by integrating human-understandable concepts into policies. However, prior work assumes that concept annotations are readily available during training. For RL, this requirement poses a significant limitation: it necessitates continuous real-time concept annotation, which either places an impractical burden on human annotators or incurs substantial costs in API queries and inference time when employing automated labeling methods. To overcome this limitation, we introduce a novel training scheme that enables RL agents to efficiently learn a concept-based policy by only querying annotators to label a small set of data. Our algorithm, LICORICE, involves three main contributions: interleaving concept learning and RL training, using an ensemble to actively select informative data points for labeling, and decorrelating the concept data. We show how LICORICE reduces human labeling efforts to 500 or fewer concept labels in three environments, and 5000 or fewer in two more complex environments, all at no cost to performance. We also explore the use of VLMs as automated concept annotators, finding them effective in some cases but imperfect in others. Our work significantly reduces the annotation burden for interpretable RL, making it more practical for real-world applications that necessitate transparency.
Paper Structure (27 sections, 6 equations, 10 figures, 11 tables, 1 algorithm)

This paper contains 27 sections, 6 equations, 10 figures, 11 tables, 1 algorithm.

Figures (10)

  • Figure 1: LICORICE overview. In concept-based RL, the policy first maps from states to the concepts in a bottleneck layer with $g$, and then maps from concepts to (distributions over) actions with $f$. During training, LICORICE addresses concept label efficiency concerns with three key components: i) iterative training, ii) data decorrelation, and iii) active learning.
  • Figure 2: LICORICE generally achieves higher reward and lower concept error than all other budget-constrained algorithms (Sequential-Q, Disagreement-Q, Random-Q). The budgets for each environment are PixelCartpole: $500$, DoorKey: $300$, DynamicObstacles: $300$, Boxing: $3000$, Pong: $5000$. CPM has an unlimited budget (in practice, it uses 4M, 4M, 1M, 15M, 10M labels, respectively). Despite this extreme budget difference, LICORICE achieves comparable reward and concept error. Full results in \ref{['tab:appx:baselines']}, \ref{['appx:addtl_results']}.
  • Figure 3: Reward (top) and concept error (bottom) of all algorithms for different budgets. LICORICE more efficiently makes use of the varying budgets, achieving higher reward and lower concept error. Full results with standard deviation are in \ref{['tab:exp2_appx', 'tab:gpt4o_appx', 'tab:baseline_budget_appx']}, \ref{['appx:addtl_results']}.
  • Figure 4: Concept intervention results: LICORICE enables test-time concept intervention.
  • Figure 5: Test-time intervention examples, where intervening on a single concept corrects the action.
  • ...and 5 more figures