Table of Contents
Fetching ...

Efficiently Generating Expressive Quadruped Behaviors via Language-Guided Preference Learning

Jaden Clark, Joey Hejna, Dorsa Sadigh

TL;DR

This work introduces Language-Guided Preference Learning (LGPL), a hybrid framework that uses a large language model (LLM) to generate high-information candidate reward parameterizations $\omega$ for quadruped locomotion and then refines them via human preference rankings to infer the target $\omega^*$. By augmenting preference learning with LLM-driven priors and sub-segment trajectory comparisons, LGPL achieves significantly improved sample efficiency, learning effective and expressive gaits with as few as four queries. Across simulations and real hardware, LGPL outperforms both purely language-parameterized approaches and traditional preference learning in both objective metrics (e.g., lower $MSE$ to ground-truth rewards) and human evaluations of alignment with desired behaviors. The approach enables rapid, user-specific gait customization suitable for social and interactive robot deployment, while acknowledging limitations related to differentiability, prompt sensitivity, and scalability to more complex task sequences.

Abstract

Expressive robotic behavior is essential for the widespread acceptance of robots in social environments. Recent advancements in learned legged locomotion controllers have enabled more dynamic and versatile robot behaviors. However, determining the optimal behavior for interactions with different users across varied scenarios remains a challenge. Current methods either rely on natural language input, which is efficient but low-resolution, or learn from human preferences, which, although high-resolution, is sample inefficient. This paper introduces a novel approach that leverages priors generated by pre-trained LLMs alongside the precision of preference learning. Our method, termed Language-Guided Preference Learning (LGPL), uses LLMs to generate initial behavior samples, which are then refined through preference-based feedback to learn behaviors that closely align with human expectations. Our core insight is that LLMs can guide the sampling process for preference learning, leading to a substantial improvement in sample efficiency. We demonstrate that LGPL can quickly learn accurate and expressive behaviors with as few as four queries, outperforming both purely language-parameterized models and traditional preference learning approaches. Website with videos: https://lgpl-gaits.github.io/

Efficiently Generating Expressive Quadruped Behaviors via Language-Guided Preference Learning

TL;DR

This work introduces Language-Guided Preference Learning (LGPL), a hybrid framework that uses a large language model (LLM) to generate high-information candidate reward parameterizations for quadruped locomotion and then refines them via human preference rankings to infer the target . By augmenting preference learning with LLM-driven priors and sub-segment trajectory comparisons, LGPL achieves significantly improved sample efficiency, learning effective and expressive gaits with as few as four queries. Across simulations and real hardware, LGPL outperforms both purely language-parameterized approaches and traditional preference learning in both objective metrics (e.g., lower to ground-truth rewards) and human evaluations of alignment with desired behaviors. The approach enables rapid, user-specific gait customization suitable for social and interactive robot deployment, while acknowledging limitations related to differentiability, prompt sensitivity, and scalability to more complex task sequences.

Abstract

Expressive robotic behavior is essential for the widespread acceptance of robots in social environments. Recent advancements in learned legged locomotion controllers have enabled more dynamic and versatile robot behaviors. However, determining the optimal behavior for interactions with different users across varied scenarios remains a challenge. Current methods either rely on natural language input, which is efficient but low-resolution, or learn from human preferences, which, although high-resolution, is sample inefficient. This paper introduces a novel approach that leverages priors generated by pre-trained LLMs alongside the precision of preference learning. Our method, termed Language-Guided Preference Learning (LGPL), uses LLMs to generate initial behavior samples, which are then refined through preference-based feedback to learn behaviors that closely align with human expectations. Our core insight is that LLMs can guide the sampling process for preference learning, leading to a substantial improvement in sample efficiency. We demonstrate that LGPL can quickly learn accurate and expressive behaviors with as few as four queries, outperforming both purely language-parameterized models and traditional preference learning approaches. Website with videos: https://lgpl-gaits.github.io/

Paper Structure

This paper contains 14 sections, 3 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: Our approach, LGPL, leverages LLMs to generate high quality candidate behaviors (1) for preference learning (2). This enables efficient design of diverse and accurate behaviors, even by non-expert users (3).
  • Figure 2: An overview of the LGPL method. A LLM is provided with in-context examples of task parameterizations $\omega$ that correspond to different quadruped gates, described in language. Given a new desired behavior, the LLM produces candidate parameterizations, which are used to rollout the policy. A user then ranks these candidates, which are eventually used for preference learning to discover the optimal $\omega$.
  • Figure 3: Results of our simulation study to evaluate query efficiency. For each task, an expert-defined ground truth $\omega^*$ was chosen and the MSE between $\omega^*$ and the learned $\omega$ is plotted.
  • Figure 4: Results for the offline user study. % win rate signifies the percentage of time users preferred LGPL.
  • Figure 5: Results for the feedback-driven user study. % win rate signifies the percentage of time users preferred that method over all 3 other methods.