Table of Contents
Fetching ...

Sample-Efficient Behavior Cloning Using General Domain Knowledge

Feiyu Zhu, Jean Oh, Reid Simmons

TL;DR

The paper tackles sample-inefficient behavior cloning by embedding expert-domain knowledge into the policy itself. It introduces Knowledge Informed Models (KIM), where an LLM generates a task-specific, semantically meaningful policy structure from natural-language knowledge, which is then tuned with demonstrations using BC. Across Lunar Lander and Car Racing, KIM achieves strong performance with very few demonstrations and shows robustness to action noise, outperforming unstructured baselines with far more parameters. The work demonstrates that leveraging domain knowledge to shape model structure can markedly improve data efficiency and resilience in sequential decision-making tasks.

Abstract

Behavior cloning has shown success in many sequential decision-making tasks by learning from expert demonstrations, yet they can be very sample inefficient and fail to generalize to unseen scenarios. One approach to these problems is to introduce general domain knowledge, such that the policy can focus on the essential features and may generalize to unseen states by applying that knowledge. Although this knowledge is easy to acquire from the experts, it is hard to be combined with learning from individual examples due to the lack of semantic structure in neural networks and the time-consuming nature of feature engineering. To enable learning from both general knowledge and specific demonstration trajectories, we use a large language model's coding capability to instantiate a policy structure based on expert domain knowledge expressed in natural language and tune the parameters in the policy with demonstrations. We name this approach the Knowledge Informed Model (KIM) as the structure reflects the semantics of expert knowledge. In our experiments with lunar lander and car racing tasks, our approach learns to solve the tasks with as few as 5 demonstrations and is robust to action noise, outperforming the baseline model without domain knowledge. This indicates that with the help of large language models, we can incorporate domain knowledge into the structure of the policy, increasing sample efficiency for behavior cloning.

Sample-Efficient Behavior Cloning Using General Domain Knowledge

TL;DR

The paper tackles sample-inefficient behavior cloning by embedding expert-domain knowledge into the policy itself. It introduces Knowledge Informed Models (KIM), where an LLM generates a task-specific, semantically meaningful policy structure from natural-language knowledge, which is then tuned with demonstrations using BC. Across Lunar Lander and Car Racing, KIM achieves strong performance with very few demonstrations and shows robustness to action noise, outperforming unstructured baselines with far more parameters. The work demonstrates that leveraging domain knowledge to shape model structure can markedly improve data efficiency and resilience in sequential decision-making tasks.

Abstract

Behavior cloning has shown success in many sequential decision-making tasks by learning from expert demonstrations, yet they can be very sample inefficient and fail to generalize to unseen scenarios. One approach to these problems is to introduce general domain knowledge, such that the policy can focus on the essential features and may generalize to unseen states by applying that knowledge. Although this knowledge is easy to acquire from the experts, it is hard to be combined with learning from individual examples due to the lack of semantic structure in neural networks and the time-consuming nature of feature engineering. To enable learning from both general knowledge and specific demonstration trajectories, we use a large language model's coding capability to instantiate a policy structure based on expert domain knowledge expressed in natural language and tune the parameters in the policy with demonstrations. We name this approach the Knowledge Informed Model (KIM) as the structure reflects the semantics of expert knowledge. In our experiments with lunar lander and car racing tasks, our approach learns to solve the tasks with as few as 5 demonstrations and is robust to action noise, outperforming the baseline model without domain knowledge. This indicates that with the help of large language models, we can incorporate domain knowledge into the structure of the policy, increasing sample efficiency for behavior cloning.

Paper Structure

This paper contains 25 sections, 4 equations, 10 figures, 2 tables.

Figures (10)

  • Figure 1: Overview of behavior cloning with general domain knowledge. We collect domain knowledge from the expert (middle) in addition to demonstrations (top). An LLM translates this knowledge into the structure of the policy (bottom) and behavior cloning is used to learn the parameters of the policy from the demonstrations.
  • Figure 2: Illustration of the Knowledge Informed Model for the Lunar Lander environment generated by GPT. Each box represents a variable, with the white boxes representing latent variables and the gray boxes representing tunable parameters in the model. Arrows represent the dependencies between variables and each has an associated learnable weight. The oval shapes represent non-linear operations. By default the value of latent variables is a linear combination of the variables that it depends on.
  • Figure 3: Success rates in the Lunar Lander task, evaluated on $100$ random start states per session. The error bars in the plot show the $95\%$ confidence interval estimated by $20$ sets of demonstration episodes. Asterisks denote the statistical significance levels of paired t-tests (* for $< 0.05$, ** for $< 0.01$, and *** for $< 0.001$).
  • Figure 4: Reward in the Car Racing task, evaluated on $100$ random tracks per session. The error bars in the plot show the $95\%$ confidence interval estimated by $10$ sets of demonstration episodes.
  • Figure 5: Reward in the Car Racing task with different levels of action noise, each evaluated on $100$ random tracks per session. The error bars in the plot show the $95\%$ confidence interval estimated by $10$ sets of demonstration episodes. Each model is trained with $10$ demonstration episodes.
  • ...and 5 more figures