CaT-BENCH: Benchmarking Language Model Understanding of Causal and Temporal Dependencies in Plans

Yash Kumar Lal; Vanya Cohen; Nathanael Chambers; Niranjan Balasubramanian; Raymond Mooney

CaT-BENCH: Benchmarking Language Model Understanding of Causal and Temporal Dependencies in Plans

Yash Kumar Lal, Vanya Cohen, Nathanael Chambers, Niranjan Balasubramanian, Raymond Mooney

TL;DR

CaT-Bench introduces a targeted benchmark for causal and temporal reasoning over natural-language plans by testing whether LLMs can determine if one plan step must precede another in cooking recipes. Built on the Recipe Flow Graph Corpus, it yields 2,840 binary questions across 57 plans, plus a companion explanation task evaluated by humans. Results show SOTA LLMs struggle to predict step dependencies in zero-shot, though prompting with explanations improves accuracy (best ~0.73 F1); however, human judgments indicate limited alignment with model reasoning and explanations can be brittle or biased. The study also reveals that chain-of-thought prompting does not universally outperform answer-then-explain, and robustness metrics reveal significant inconsistency across questions and models, underscoring substantial room for improvement in procedural understanding and trustworthy reasoning.

Abstract

Understanding the abilities of LLMs to reason about natural language plans, such as instructional text and recipes, is critical to reliably using them in decision-making systems. A fundamental aspect of plans is the temporal order in which their steps needs to be executed, which reflects the underlying causal dependencies between them. We introduce CaT-Bench, a benchmark of Step Order Prediction questions, which test whether a step must necessarily occur before or after another in cooking recipe plans. We use this to evaluate how well frontier LLMs understand causal and temporal dependencies. We find that SOTA LLMs are underwhelming (best zero-shot is only 0.59 in F1), and are biased towards predicting dependence more often, perhaps relying on temporal order of steps as a heuristic. While prompting for explanations and using few-shot examples improve performance, the best F1 result is only 0.73. Further, human evaluation of explanations along with answer correctness show that, on average, humans do not agree with model reasoning. Surprisingly, we also find that explaining after answering leads to better performance than normal chain-of-thought prompting, and LLM answers are not consistent across questions about the same step pairs. Overall, results show that LLMs' ability to detect dependence between steps has significant room for improvement.

CaT-BENCH: Benchmarking Language Model Understanding of Causal and Temporal Dependencies in Plans

TL;DR

Abstract

Paper Structure (40 sections, 12 figures, 11 tables)

This paper contains 40 sections, 12 figures, 11 tables.

Introduction
Related Work
CaT-Bench
Automatic Metrics
Human Evaluation
Benchmarking Models on CaT-Bench
Models
How Good Are Model Predictions?
Models struggle at predicting step order.
Generating explanations improves performance.
Models are biased towards predicting dependence.
How Good Are Model Explanations?
Analysis
Robustness of Models
Temporal Consistency
...and 25 more sections

Figures (12)

Figure 1: We use step-pair dependency annotations to create CaT-Bench, a question-driven evaluation framework for plan-based reasoning. Questions in this benchmark elicit reasoning about different causal relations such as preconditions, effects and step independence.
Figure 2: Examples of different types of questions in a plan from CaT-Bench. To correctly answer these questions, one must understand preconditions and effects (to answer Dep), some steps need not be performed in any particular order and that plans can contain subplans within them (to answer NonDep).
Figure 3: Since two steps that are not dependent on each other can be performed in any order, we swap their order in the plan and ask binary questions about them similar to NonDep. Note that, while the plan itself is altered, the question remains the same.
Figure 4: Example of hallucinations produced by GPT-4 in the (E + A) setting.
Figure 5: Examples of cases where GPT-4 comes up with good (upper box) and bad (lower box) answers. This error is of the multi-hop dependency type. To make shortcakes, removing the cake from the oven (Step 10) is dependent on baking the cake (step 9) which is later dependent on combining the ingredients (Step 2). Examples of other error types can be found in \ref{['fig:lemon-cat-examples']}.
...and 7 more figures

CaT-BENCH: Benchmarking Language Model Understanding of Causal and Temporal Dependencies in Plans

TL;DR

Abstract

CaT-BENCH: Benchmarking Language Model Understanding of Causal and Temporal Dependencies in Plans

Authors

TL;DR

Abstract

Table of Contents

Figures (12)