Table of Contents
Fetching ...

Safe and Optimal Learning from Preferences via Weighted Temporal Logic with Applications in Robotics and Formula 1

Ruya Karagulle, Cristian-Ioan Vasile, Necmiye Ozay

TL;DR

This paper addresses safe learning from human feedback by encoding task specifications and preferences in $WSTL$, enabling safety guarantees while learning weights that best explain observed data. It introduces two complementary techniques—structural pruning and a $\log$-transform—that reduce problem size and linearize the optimization, respectively, allowing an exact MILP formulation. The approach is validated on both a robot navigation task and real-world Formula 1 data, demonstrating that it can capture nuanced preferences and reveal interpretable insights into performance factors. The work advances safe, optimal, and interpretable learning from demonstrations, rankings, and comparisons with practical implications for autonomous robotics and competitive settings like Formula 1 racing.

Abstract

Autonomous systems increasingly rely on human feedback to align their behavior, expressed as pairwise comparisons, rankings, or demonstrations. While existing methods can adapt behaviors, they often fail to guarantee safety in safety-critical domains. We propose a safety-guaranteed, optimal, and efficient approach to solve the learning problem from preferences, rankings, or demonstrations using Weighted Signal Temporal Logic (WSTL). WSTL learning problems, when implemented naively, lead to multi-linear constraints in the weights to be learned. By introducing structural pruning and log-transform procedures, we reduce the problem size and recast the problem as a Mixed-Integer Linear Program while preserving safety guarantees. Experiments on robotic navigation and real-world Formula 1 data demonstrate that the method effectively captures nuanced preferences and models complex task objectives.

Safe and Optimal Learning from Preferences via Weighted Temporal Logic with Applications in Robotics and Formula 1

TL;DR

This paper addresses safe learning from human feedback by encoding task specifications and preferences in , enabling safety guarantees while learning weights that best explain observed data. It introduces two complementary techniques—structural pruning and a -transform—that reduce problem size and linearize the optimization, respectively, allowing an exact MILP formulation. The approach is validated on both a robot navigation task and real-world Formula 1 data, demonstrating that it can capture nuanced preferences and reveal interpretable insights into performance factors. The work advances safe, optimal, and interpretable learning from demonstrations, rankings, and comparisons with practical implications for autonomous robotics and competitive settings like Formula 1 racing.

Abstract

Autonomous systems increasingly rely on human feedback to align their behavior, expressed as pairwise comparisons, rankings, or demonstrations. While existing methods can adapt behaviors, they often fail to guarantee safety in safety-critical domains. We propose a safety-guaranteed, optimal, and efficient approach to solve the learning problem from preferences, rankings, or demonstrations using Weighted Signal Temporal Logic (WSTL). WSTL learning problems, when implemented naively, lead to multi-linear constraints in the weights to be learned. By introducing structural pruning and log-transform procedures, we reduce the problem size and recast the problem as a Mixed-Integer Linear Program while preserving safety guarantees. Experiments on robotic navigation and real-world Formula 1 data demonstrate that the method effectively captures nuanced preferences and models complex task objectives.

Paper Structure

This paper contains 13 sections, 3 theorems, 7 equations, 3 figures, 1 table, 1 algorithm.

Key Result

Theorem 1

Given a signal $\sigma$ and a formula $\phi$, structural pruning preserves the quantitative semantics of STL formulas. That is, the robustness value computed from the pruned RCT is $\rho(\sigma,\phi,t)$.

Figures (3)

  • Figure 1: Trees associated with $\phi = \Diamond_{[3,5]}(0\leq \sigma \leq 5)$; AST, RCT, and pruned RCT for signal $\sigma = [5,6,7,-1,4,2]$, respectively.
  • Figure 2: Trajectories generated using three different preference sets. PD1 is the original preference set, PD2 is obtained by flipping the answer to a single pair in PD1, and PD3 is obtained by reverting all answers in PD1.
  • Figure 3: The evolution of final standing predictions and the prediction accuracy over the laps at the 2025 Monza Grand Prix. The right block shows the correct standing when DNFs are included and excluded. Each team is represented with a separate color, and drivers from the same team are represented with solid and dashed lines. For readability purposes, driver abbreviations are kept instead of enumerations.

Theorems & Definitions (7)

  • Example 1
  • Theorem 1
  • Example 1: continued
  • Theorem 2
  • proof
  • Theorem 3
  • proof