GinSign: Grounding Natural Language Into System Signatures for Temporal Logic Translation
William English, Chase Walker, Dominic Simon, Rickard Ewetz
TL;DR
GinSign tackles the critical gap in NL-to-TL translation by grounding all atomic propositions to a formal system signature, enabling executable TL and downstream model checking. It introduces a hierarchical grounding framework that first selects a predicate and then grounds typed arguments using a prefix-based BERT classifier, dramatically reducing label space and avoiding reliance on large LLMs. Across VLTL-Bench domains, GinSign achieves perfect predicate grounding, strong argument grounding, and high end-to-end grounded translation accuracy, outperforming prompting baselines and enabling semantic verification. The approach demonstrates robust generalization to unseen domains and terminology, highlighting the practical potential for grounded NL-to-TL in trustworthy autonomous systems.
Abstract
Natural language (NL) to temporal logic (TL) translation enables engineers to specify, verify, and enforce system behaviors without manually crafting formal specifications-an essential capability for building trustworthy autonomous systems. While existing NL-to-TL translation frameworks have demonstrated encouraging initial results, these systems either explicitly assume access to accurate atom grounding or suffer from low grounded translation accuracy. In this paper, we propose a framework for Grounding Natural Language Into System Signatures for Temporal Logic translation called GinSign. The framework introduces a grounding model that learns the abstract task of mapping NL spans onto a given system signature: given a lifted NL specification and a system signature $\mathcal{S}$, the classifier must assign each lifted atomic proposition to an element of the set of signature-defined atoms $\mathcal{P}$. We decompose the grounding task hierarchically- first predicting predicate labels, then selecting the appropriately typed constant arguments. Decomposing this task from a free-form generation problem into a structured classification problem permits the use of smaller masked language models and eliminates the reliance on expensive LLMs. Experiments across multiple domains show that frameworks which omit grounding tend to produce syntactically correct lifted LTL that is semantically nonequivalent to grounded target expressions, whereas our framework supports downstream model checking and achieves grounded logical-equivalence scores of $95.5\%$, a $1.4\times$ improvement over SOTA.
