RESTL: Reinforcement Learning Guided by Multi-Aspect Rewards for Signal Temporal Logic Transformation
Yue Fang, Jin Zhi, Jie An, Hongshen Chen, Xiaohong Chen, Naijun Zhan
TL;DR
RESTL addresses the NL-to-STL transformation problem for cyber-physical systems by combining reinforcement learning with multi-aspect reward models. It introduces four signals—Atomic Proposition Alignment, Templated NL Similarity, Formula Succinctness, and STL-level Similarity—trained via curriculum learning and aggregated to guide an STL generator with PPO under KL regularization. The approach yields substantial improvements over state-of-the-art baselines on two NL-to-STL benchmarks in both automatic metrics and human judgments, and ablations confirm the value of each reward component and the curriculum strategy. This framework advances practical automatic translation of natural language requirements into precise STL specifications, with potential impact on CPS design, verification, and runtime monitoring.
Abstract
Signal Temporal Logic (STL) is a powerful formal language for specifying real-time specifications of Cyber-Physical Systems (CPS). Transforming specifications written in natural language into STL formulas automatically has attracted increasing attention. Existing rule-based methods depend heavily on rigid pattern matching and domain-specific knowledge, limiting their generalizability and scalability. Recently, Supervised Fine-Tuning (SFT) of large language models (LLMs) has been successfully applied to transform natural language into STL. However, the lack of fine-grained supervision on atomic proposition correctness, semantic fidelity, and formula readability often leads SFT-based methods to produce formulas misaligned with the intended meaning. To address these issues, we propose RESTL, a reinforcement learning (RL)-based framework for the transformation from natural language to STL. RESTL introduces multiple independently trained reward models that provide fine-grained, multi-faceted feedback from four perspectives, i.e., atomic proposition consistency, semantic alignment, formula succinctness, and symbol matching. These reward models are trained with a curriculum learning strategy to improve their feedback accuracy, and their outputs are aggregated into a unified signal that guides the optimization of the STL generator via Proximal Policy Optimization (PPO). Experimental results demonstrate that RESTL significantly outperforms state-of-the-art methods in both automatic metrics and human evaluations.
