Table of Contents
Fetching ...

Cook2LTL: Translating Cooking Recipes to LTL Formulae using Large Language Models

Angelos Mavrogiannis, Christoforos Mavrogiannis, Yiannis Aloimonos

TL;DR

Cook2LTL presents a framework to translate free-form cooking recipes into robot-executable temporal logic by grounding high-level actions to a primitive action set and caching deconstructed actions in a dynamic library. The approach leverages a semantic parser trained on Recipe1M+ data and LLM-based action reduction to map actions to primitives, then translates sequences into LTL formulae such as $F(\psi_1 \wedge F(\psi_2 \wedge \dots))$ to capture temporal order. An ablation study demonstrates substantial reductions in API calls, latency, and cost when using the action library, while maintaining high executability across recipes. Demonstrations in AI2-THOR confirm the method's potential for sim-to-real transfer, though the results highlight sensitivity to initial LLM outputs and the need for robust execution feedback and more extensive datasets. The work advances practical, temporally-aware task planning for cooking in robotics by marrying semantic parsing, few-shot prompting, and formal temporal logic grounding.

Abstract

Cooking recipes are challenging to translate to robot plans as they feature rich linguistic complexity, temporally-extended interconnected tasks, and an almost infinite space of possible actions. Our key insight is that combining a source of cooking domain knowledge with a formalism that captures the temporal richness of cooking recipes could enable the extraction of unambiguous, robot-executable plans. In this work, we use Linear Temporal Logic (LTL) as a formal language expressive enough to model the temporal nature of cooking recipes. Leveraging a pretrained Large Language Model (LLM), we present Cook2LTL, a system that translates instruction steps from an arbitrary cooking recipe found on the internet to a set of LTL formulae, grounding high-level cooking actions to a set of primitive actions that are executable by a manipulator in a kitchen environment. Cook2LTL makes use of a caching scheme that dynamically builds a queryable action library at runtime. We instantiate Cook2LTL in a realistic simulation environment (AI2-THOR), and evaluate its performance across a series of cooking recipes. We demonstrate that our system significantly decreases LLM API calls (-51%), latency (-59%), and cost (-42%) compared to a baseline that queries the LLM for every newly encountered action at runtime.

Cook2LTL: Translating Cooking Recipes to LTL Formulae using Large Language Models

TL;DR

Cook2LTL presents a framework to translate free-form cooking recipes into robot-executable temporal logic by grounding high-level actions to a primitive action set and caching deconstructed actions in a dynamic library. The approach leverages a semantic parser trained on Recipe1M+ data and LLM-based action reduction to map actions to primitives, then translates sequences into LTL formulae such as to capture temporal order. An ablation study demonstrates substantial reductions in API calls, latency, and cost when using the action library, while maintaining high executability across recipes. Demonstrations in AI2-THOR confirm the method's potential for sim-to-real transfer, though the results highlight sensitivity to initial LLM outputs and the need for robust execution feedback and more extensive datasets. The work advances practical, temporally-aware task planning for cooking in robotics by marrying semantic parsing, few-shot prompting, and formal temporal logic grounding.

Abstract

Cooking recipes are challenging to translate to robot plans as they feature rich linguistic complexity, temporally-extended interconnected tasks, and an almost infinite space of possible actions. Our key insight is that combining a source of cooking domain knowledge with a formalism that captures the temporal richness of cooking recipes could enable the extraction of unambiguous, robot-executable plans. In this work, we use Linear Temporal Logic (LTL) as a formal language expressive enough to model the temporal nature of cooking recipes. Leveraging a pretrained Large Language Model (LLM), we present Cook2LTL, a system that translates instruction steps from an arbitrary cooking recipe found on the internet to a set of LTL formulae, grounding high-level cooking actions to a set of primitive actions that are executable by a manipulator in a kitchen environment. Cook2LTL makes use of a caching scheme that dynamically builds a queryable action library at runtime. We instantiate Cook2LTL in a realistic simulation environment (AI2-THOR), and evaluate its performance across a series of cooking recipes. We demonstrate that our system significantly decreases LLM API calls (-51%), latency (-59%), and cost (-42%) compared to a baseline that queries the LLM for every newly encountered action at runtime.
Paper Structure (16 sections, 6 equations, 5 figures, 2 tables, 1 algorithm)

This paper contains 16 sections, 6 equations, 5 figures, 2 tables, 1 algorithm.

Figures (5)

  • Figure 1: Cook2LTL in AI2-THORkolve2017ai2: The robot is given the instruction Refrigerate the apple. Cook2LTL produces an initial LTL formula $\phi$ (top left); then it queries an LLM to retrieve the low-level admissible primitives for executing the action; finally it generates a formula consisting of 4 atomic propositions ($\psi_1,\psi_2,\psi_3,\psi_4$) that provide the required task specification and yield these consecutive scenes.
  • Figure 2: Cook2LTL System: The input instruction $r_i$ is first preprocessed and then passed to the semantic parser, which extracts meaningful chunks corresponding to the categories $\mathcal{C}$ and constructs a function representation $\mathtt{a}$ for each detected action. If $\mathtt{a}$ is part of the action library $\mathbb{A}$, then the LTL translator infers the final LTL formula $\phi$. Otherwise, the action is reduced to a sequence of lower-level admissible actions {$a_1,a_2,\dots a_k\}$ from $\mathcal{A}$, and the reduction policy is cached to $\mathbb{A}$ for future use. The LTL translator then yields the final LTL formulae based on the derived actions.
  • Figure 3: We annotate Recipe1M+ marin2019learning instruction steps with the salient categories $\mathcal{C}=${Verb, What?, Where?, How?, Temperature, Time} and fine-tune a named entity recognizer to segment chunks corresponding to $\mathcal{C}$.
  • Figure 4: Inspired by ProgPrompt singh2022progprompt, Cook2LTL uses an LLM prompting scheme to reduce a high-level cooking action (e.g. boil eggs) to a series of primitive manipulation actions. The prompt consists of an import statement of the primitive action set and example function definitions of similar cooking tasks. The key benefit of using this paradigm is that it constrains the output action plan of the LLM to only include subsets of the available primitive actions. We extend this prompting scheme by reusing derived LLM policies. In this case, the action boil is added to future import statements in the input prompt, enabling the model to invoke the derived boil function which is now considered given to the system.
  • Figure 5: Tasks we tested Cook2LTL in AI2-THOR (left to right): microwave the potato; chop the tomato; cut the bread; refrigerate the apple.